In this post, I will go through a demo of using Lucene’s simple API for indexing and searching Tweets. We will be indexing Tweets from the Sentiment140 Tweet corpus. This dataset provides the following data points for each Tweet:
- the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- the id of the tweet (2087)
- the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- the query (lyx). If there is no query, then this value is NO_QUERY.
- the user that tweeted (robotickilldozr)
- the text of the tweet (Lyx is cool)
Here is example of the tweets in the file:
"0","1468051743","Mon Apr 06 23:27:33 PDT 2009","NO_QUERY","someone81","Grr not down to go to school today" "0","1468011579","Tue Dec 26 12:29:11 PDT 2008","NO_QUERY","some1else","Another tweet"
At this point, we can see that these columns cleanly map to Fields that Lucene will end up indexing. See the previous Introduction to Lucene post for more information about Fields.
Creating the IndexWriter
Let’s start by creating our IndexWriter (see below). In our example, we will be using the local File System to store our index. Another option is to store our index in main memory (RAMDirectory). This option is suitable for smaller indexes, where response time is the highest priority.
Notice that we also configured our IndexWriter with a KeywordAnalyzer. This Analyzer generates a single token for the entire string. We can choose many different Analyzers, based on our use case. Other standard Analyzers include the StandardAnalyzer, WhitespaceAnalyzer, StopAnalyzer and SnowballAnalyer. You could even implement your Analyzer that suits your use case!
The first step in adding Documents is configuring our FieldType. The two main options we care about storing vs indexing. Storing the value means that the actual value will also be stored in the index. This is useful when we want to output the value when searching. We can compress these values for large documents. The other option is to index the value. We can turn off indexing for a field when we know that our Lucene search queries are not going to be using that field for lookups.
In our case, we want to configure all the fields to be stored and indexed. So here is how we create our FieldType:
Creating Fields and adding to Documents
In this next code snippet, we create our Document object and the Fields that we are interested in.
At this point, we our index is created and we can start querying.
Lucene provides a very robust API for constructing and executing queries on our index. The documentation provides a good introduction to the search syntax. We will look at building various queries to search our Tweet index.
Creating the IndexSearcher
This section mirrors the above “Creating the IndexWriter” section. In order to be able to run search queries on our index, we need to make use of the IndexSearcher class. Each IndexSearcher requires a pointer to the location of our index in our File System (or in memory). We also provide it a DirectoryReader object that atomically opens and reads the index. Here is how to do that:
Building the query
Once we have our IndexWriter, we need to create our Query. Here is one example using a TermQuery to find all Tweets based on a certain value fo a given field. In this query, we are finding all tweets by a certain user:
The TopDocs represent the results our query. The ScoreDocs represent the document that matched the query and the score of the result. Once we have our resuts, we can print out the document’s data using the IndexSearcher:
To see the entire source code, visit this repository.