A Deep Dive into Lucene Analysis Pipeline

Published on Mon Aug 15 2022

Introduction

In Lucene, the process of analysis is fundamental to both indexing content and parsing queries. It is how raw text is converted into a stream of processed tokens suitable for a search index. It involves breaking down the text into tokens, removing noise (stop words), handling synonyms and phonetics, etc.

Typically, the same analyzer is used for both indexing and searching to ensure that queries are interpreted in the same way as the source documents, with exceptions in some cases.

Anatomy of an Analyzer

An analyzer is a pipeline that processes text. It takes in raw text through a Reader and, after several transformation steps, produces a TokenStream.

Loading animation...

An Analyzer pipeline consists of 3 components, executed in order

  1. CharFilter(s): Cleans the raw text at character level.
  2. Tokenizer: Breaks the cleaned text into individual tokens.
  3. Token Filter(s): Processes the tokens from the tokenizer.

Loading animation...

Let's look at each component.

1. CharFilter: Cleaning the Text

CharFilter can be used to perform operations on the raw text before it gets to the tokenizer. It operates on the character stream itself rather than tokens. This is useful for tasks like:

  • Stripping out HTML tags like <p>, <b> etc., so they aren't treated as part of the text.
  • Replacing symbols or character. For example, converting "&" to "and"
  • Applying custom pattern-based replace.

A CharFilter takes a Reader and produces another Reader with the transformation applied. A CharFilter itself is a subclass of Reader while Tokenizer and TokenFilter are subclasses of TokenStream.

2. Tokenizer: Breaking Text into Tokens

Once the text is cleaned by CharFilter(s), the Tokenizer takes over. It is responsible for breaking the text into a stream of tokens.

Different tokenizers are available in Lucene with each having different strategies to tokenize the text. For instance, a WhitespaceTokenizer simply splits the text on whitespace. The most common choice is the StandardTokenizer and should suffice the needs of most users. It splits text based on word boundaries as defined by the Unicode Text Segmentation algorithm, and it intelligently handles things like punctuation and acronyms.

3. TokenFilter: Processing Tokens

The TokenFilter is the final and most powerful stage of the analysis pipelien. It operates on TokenStream object which is a stream of tokens. Tokens that are spit out by Tokenizer are passed through a series of token filters. B

Built-in token filters are available that can perform a wide variety of jobs. For example:

  • SynonymGraphFilter : Adds synonyms for tokens (e.g, mapping "shit" to "poop")
  • StopFilter : Removing common, low-value words (e.g "a", "is", "the")
  • PorterStemFilter: Reduces words to their root form (e.g, "runnings" becomes "run")

Other tasks include trimming tokens, handling phonetics, removing stopwords and producing ngrams.

Creating a Custom Analyzer

While you can just use one of inbuilt analyzers in Lucene, you often need to create your own custom analysis pipeline. This can be done by extending Analyzer class and implementing the createComponents method.

public class MyAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName) { // Tokenizer final Tokenizer source = new StandardTokenizer(); //Token Filters TokenStream result = new LowerCaseFilter(source); result = new StopFilter(result, EnglishAnalyzer.ENGLISH_STOP_WORDS_SET); result = new PorterStemFilter(result); //add more token filters as necessary return new TokenStreamComponents(source, result); } @Override protected Reader initReader(String fieldName, Reader reader) { //Add CharFilter(s) //string HTML tags Reader newReader = new HTMLStripCharFilter(reader); //Map "&" to "and" NormalizeCharMap normMap = new NormalizeCharMap.Builder() .add("&", "and") .build(); newReader = new MappingCharFilter(normMap, newReader); // You can add more CharFilters as necessary return newReader; } }

Summary

Lucene Analyzer is a very flexible text processing pipeline. By understanding and combining the three components (CharFilter, Tokenizer and TokenFilter) you can precisely control how your text is indexed and searched, leading to more accurate and relevant results.