A Deep Dive into Lucene Analysis Pipeline
Published on Mon Aug 15 2022
Introduction
In Lucene, the process of analysis is fundamental to both indexing content and parsing queries. It is how raw text is converted into a stream of processed tokens suitable for a search index. It involves breaking down the text into tokens, removing noise (stop words), handling synonyms and phonetics, etc.
Typically, the same analyzer is used for both indexing and searching to ensure that queries are interpreted in the same way as the source documents, with exceptions in some cases.
Anatomy of an Analyzer
An analyzer is a pipeline that processes text. It takes in raw text through a Reader
and, after several transformation steps, produces a TokenStream
.
Loading animation...
An Analyzer pipeline consists of 3 components, executed in order
- CharFilter(s): Cleans the raw text at character level.
- Tokenizer: Breaks the cleaned text into individual tokens.
- Token Filter(s): Processes the tokens from the tokenizer.
Loading animation...
Let's look at each component.
1. CharFilter: Cleaning the Text
CharFilter can be used to perform operations on the raw text before it gets to the tokenizer. It operates on the character stream itself rather than tokens. This is useful for tasks like:
- Stripping out HTML tags like <p>, <b> etc., so they aren't treated as part of the text.
- Replacing symbols or character. For example, converting "&" to "and"
- Applying custom pattern-based replace.
A CharFilter
takes a Reader
and produces another Reader
with the transformation applied. A CharFilter
itself is a subclass of Reader
while Tokenizer
and TokenFilter
are subclasses of TokenStream
.
2. Tokenizer: Breaking Text into Tokens
Once the text is cleaned by CharFilter(s), the Tokenizer takes over. It is responsible for breaking the text into a stream of tokens.
Different tokenizers are available in Lucene with each having different strategies to tokenize the text. For instance, a WhitespaceTokenizer
simply splits the text on whitespace. The most common choice is the StandardTokenizer
and should suffice the needs of most users. It splits text based on word boundaries as defined by the Unicode Text Segmentation algorithm, and it intelligently handles things like punctuation and acronyms.
3. TokenFilter: Processing Tokens
The TokenFilter
is the final and most powerful stage of the analysis pipelien. It operates on TokenStream
object which is a stream of tokens. Tokens that are spit out by Tokenizer are passed through a series of token filters. B
Built-in token filters are available that can perform a wide variety of jobs. For example:
SynonymGraphFilter
: Adds synonyms for tokens (e.g, mapping "shit" to "poop")StopFilter
: Removing common, low-value words (e.g "a", "is", "the")PorterStemFilter
: Reduces words to their root form (e.g, "runnings" becomes "run")
Other tasks include trimming tokens, handling phonetics, removing stopwords and producing ngrams.
Creating a Custom Analyzer
While you can just use one of inbuilt analyzers in Lucene, you often need to create your own custom analysis pipeline. This can be done by extending Analyzer
class and implementing the createComponents
method.
public class MyAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName) { // Tokenizer final Tokenizer source = new StandardTokenizer(); //Token Filters TokenStream result = new LowerCaseFilter(source); result = new StopFilter(result, EnglishAnalyzer.ENGLISH_STOP_WORDS_SET); result = new PorterStemFilter(result); //add more token filters as necessary return new TokenStreamComponents(source, result); } @Override protected Reader initReader(String fieldName, Reader reader) { //Add CharFilter(s) //string HTML tags Reader newReader = new HTMLStripCharFilter(reader); //Map "&" to "and" NormalizeCharMap normMap = new NormalizeCharMap.Builder() .add("&", "and") .build(); newReader = new MappingCharFilter(normMap, newReader); // You can add more CharFilters as necessary return newReader; } }
Summary
Lucene Analyzer is a very flexible text processing pipeline. By understanding and combining the three components (CharFilter
, Tokenizer
and TokenFilter
) you can precisely control how your text is indexed and searched, leading to more accurate and relevant results.