In this post, I will go through a demo of using Lucene’s simple API for indexing and searching Tweets. We will be indexing Tweets from the Sentiment140 Tweet corpus. This dataset provides the following data points for each Tweet:

  1. the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  2. the id of the tweet (2087)
  3. the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  4. the query (lyx). If there is no query, then this value is NO_QUERY.
  5. the user that tweeted (robotickilldozr)
  6. the text of the tweet (Lyx is cool)

Here is example of the tweets in the file:

1
2
"0","1468051743","Mon Apr 06 23:27:33 PDT 2009","NO_QUERY","someone81","Grr not down to go to school today"
"0","1468011579","Tue Dec 26 12:29:11 PDT 2008","NO_QUERY","some1else","Another tweet"

At this point, we can see that these columns cleanly map to Fields that Lucene will end up indexing. See the previous Introduction to Lucene post for more information about Fields.

Indexing

Creating the IndexWriter

Let’s start by creating our IndexWriter (see below). In our example, we will be using the local File System to store our index. Another option is to store our index in main memory (RAMDirectory). This option is suitable for smaller indexes, where response time is the highest priority.

IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir), new IndexWriterConfig(Version.LUCENE_44, new KeywordAnalyzer()));

Configuring Analyzers

Notice that we also configured our IndexWriter with a KeywordAnalyzer. This Analyzer generates a single token for the entire string. We can choose many different Analyzers, based on our use case. Other standard Analyzers include the StandardAnalyzer, WhitespaceAnalyzer, StopAnalyzer and SnowballAnalyer. You could even implement your Analyzer that suits your use case!

Adding Documents

Configuring Fields

The first step in adding Documents is configuring our FieldType. The two main options we care about storing vs indexing. Storing the value means that the actual value will also be stored in the index. This is useful when we want to output the value when searching. We can compress these values for large documents. The other option is to index the value. We can turn off indexing for a field when we know that our Lucene search queries are not going to be using that field for lookups.

In our case, we want to configure all the fields to be stored and indexed. So here is how we create our FieldType:

FieldType fieldType = new FieldType();
        fieldType.setStored(true);
        fieldType.setIndexed(true);

Creating Fields and adding to Documents

In this next code snippet, we create our Document object and the Fields that we are interested in.

Document document = new Document();

document.add(new Field(DATE, tweetInfo[2], fieldType));
document.add(new StringField(USER, tweetInfo[4], Field.Store.YES));
document.add(new StringField(TEXT, tweetInfo[5], Field.Store.YES));

indexWriter.addDocument(document);

At this point, we our index is created and we can start querying.

Searching

Lucene provides a very robust API for constructing and executing queries on our index. The documentation provides a good introduction to the search syntax. We will look at building various queries to search our Tweet index.

Creating the IndexSearcher

This section mirrors the above “Creating the IndexWriter” section. In order to be able to run search queries on our index, we need to make use of the IndexSearcher class. Each IndexSearcher requires a pointer to the location of our index in our File System (or in memory). We also provide it a DirectoryReader object that atomically opens and reads the index. Here is how to do that:

Directory directory = FSDirectory.open(indexDir);
DirectoryReader directoryReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(directoryReader);

Building the query

Once we have our IndexWriter, we need to create our Query. Here is one example using a TermQuery to find all Tweets based on a certain value fo a given field. In this query, we are finding all tweets by a certain user:

Term term = new Term(USER, "scotthamilton");
Query query = new TermQuery(term);
TopDocs topDocs = indexSearcher.search(query, numHits);

The TopDocs represent the results our query. The ScoreDocs represent the document that matched the query and the score of the result. Once we have our resuts, we can print out the document’s data using the IndexSearcher:

for (int i = 0; i < results.length; i++) {
  int docId = results[i].doc;
  Document foundDocument = indexSearcher.doc(docId);
  System.out.println(foundDocument.get(USER) + " : " + foundDocument.get(TEXT));
}

To see the entire source code, visit this repository.

Solr

Apache Solr is a enterprise-level HTTP search server built on top of Lucene. Solr performs all the operations of Lucene and provides additional features that are not available in Lucene. Documents (in XML, JSON, CSV or binary format) are added to the index via a POST; Search results (in XML, JSON, CSV or binary format) are returned via a HTTP GET. Here are a few of the feautres Solr provides:

  • Full text search - Allows for complex search queries
  • Logging - Provides logging for debugging and support purposes.
  • Near real time indexing - Search results are updated soon after a document has been indexed.
  • Faceted search - Allows search results to be categorized in sub-groups.
  • Geo-spatial search - searching based on geographic location

SolrCloud

SolrCloud is a distributed version of Solr. It provides distributed indexing and searching for large scale, fault tolerant Solr server. It uses Apache ZooKeeper for cluster configuration and management. When the data being indexed is too large for a single server, SolrCloud breaks it up into shards. Shards are split portions of the entire index and can be distributed across different servers in a cluster. When adding documents to the index, SolrCloud figures out the correct machine to which this shard should belong. SolrCloud provides additional features on top of Solr, including:

  • Automatic failover - if a single node goes down, its index is replicated on a different node using a backup
  • Maintaining consistency - updates to the index must be directed to the correct shard so that the one, consistent view of the document in maintained
  • Automatic shard partitioning - SolrCloud only needs to know the number of shards and it takes care of partitioning the index; it even forwards updates to the index to the correct index
  • Simple Configuration - SolrCloud uses ZooKeeper for configuring the cluster, which centralizes the configuration for the cluster.

Here is a diagram showing the SolrCloud architecture source

Solr Architecture

Elasticsearch vs SolrCloud

Elasticsearch is a another enterprise search engine built on top of Apache Lucene. It is a competitor to SolrCloud; both add features to Lucene and provide an HTTP wrapper around Lucene through which documents can be indexed and searched. Here are a few differences between the Elasticsearch and SolrCloud:

  • Solr uses Zookeeper for cluster configuration, while Elasticsearch uses an internal coordination mechanism for configuration
  • Both ES and SC use the concept of sharding (partitions of Lucene index).
  • Elasticsearch’s uses a JSON query syntax, while Solr uses a simple key/value pair query
  • Elasticsearch’s killer feature is the Percolator. This allows the user to register certain queries to generate an alert when documents are added that match that query. Description from the documentation: “Instead of sending docs, indexing them, and then running queries, one sends queries, registers them, and then sends docs and finds out which queries match that doc.”

See this for more information about the differences between SolrCloud and Elasticsearch.

Apache Lucene™ is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

What is Lucene?

Lucene is a library that allows the user to index textual data (Word & PDF documents, emails, webpages, tweets etc). It allows you to add search capabilities to your application. There are two main steps that Lucene performs:

  1. Create an index of documents you want to search.
  2. Parse query, search index, return results.

Indexing

Lucene uses an inverted index (mapping of a term to its metadata). This metadata includes information about which files contain this term, number of occurrences etc. The fundamental units of indexing in Lucene are the Document and Field classes:

  • A Document is a container that contains one or more Fields.
  • A Field stores the terms we want to index and search on. It stores a mapping of a key (name of the field) and a value (value of the field that we find in the content).

Here is a diagram describing the steps Lucene takes when indexing content (Source: Lucene in Action, Figure 2.1).

Indexing Overview

Extracting Text

In order to be able to index documents of various types, Lucene needs to be able to extract the test from the given document into a format that it can parse. Apache Tika is one framework that parses documents and extracts text content.

Analyze

This process filters and cleans up the text data. The text data goes through several steps (for example: extracting words, removing common (stop) words, make words lowercase etc) and converts the text into tokens that can be added to the index. The picture to the right shows the indexing process which results in an inverted index being stored on the underlying filesystem. See below for an example of an inverted index.

Smithsonian Image

Write Index

Lucene uses an inverted index data structure for storing the Fields we want to search on. An inverted index uses the tokens as the lookup key to find the documents which contains that token. It maps the content to its location. The index can be physically stored as part of a Directory (either in a file system or in memory).

Below is an example of an inverted index. Logically, this represents the result of the indexing process.

Inverted Index

Searching

Once our documents are indexed, we will need to add search functionality. All queries of the index are done through the IndexSearcher. Given a search expression, we parse the query, create a QueryParser and search the index for results. The results are returned as TopDocs which contain ScoreDocs, which contain the document IDs and the confidence scores of the results that match the query. The fundamental classes for searching are:

  • IndexSearcher - Provides “read-only” access to the index. Exposes several search methods that take in a Query object and return the top n “best” TopDocs as the result. This class is the counter part to the IndexWriter class used for creating/updating indexes.
  • Term - Basic unit for searching. Counter part to the Field object used in indexing. We create a certain Field when indexing (for ex: “Name” : “Chuck Norris”) and we use Terms in a TermQuery when searching. It contains the same mapping from the name of the field to the value
  • Query: Lucene provides several types of Queries, including TermQuery, BooleanQuery, PrefixQuery, WildcardQuery, PhraseQuery, and FuzzyQuery. Each type of query provides a unique way of searching the index.
  • QueryParser: Parses a human-readable query (for ex: “opower AND arlington”) into Query object that can be used for searching.
  • TopDocs - Container for pointers to N search results. Each TopDoc contains a document ID and a confidence score.

Here is an overview of the process described above:

Searching Process

This concludes a high-level introduction to Apache Lucene. In future posts, I will explore Solr and give an example of using Lucene in a real application. The inspiration for this series is derived from a meetup of the Washington D.C. Hadoop Users Group in which Douglas Cutting spoke about Lucene.

References

McCandless, Michael; Hatcher, Erik; Gospodnetić, Otis (2010). Lucene in Action, Second Edition. Manning.