Skip to content

Wildcard searching with ngrams

medihack edited this page Feb 9, 2011 · 7 revisions

I have read a lot of posts from people wanting to make wildcard searches with Sunspot, and being stopped, simply because the Dismax Query Parser does not (yet) support wildcards (e.g. "sun*" should find "sunspot").

A simple solution lies in using an extra filter factory in your schema.xml. Out of the box, your text field is defined as:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

The filter you should add is the EdgeNGramFilterFactory - like this:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

You can read a complete explanation of EdgeNGramFilterFactory here. Basically it takes every token and breaks it down into multiple tokens called "n-grams". In the above configuration "sunspot" is broken down into "su", "sun", "suns", "sunsp", "sunspo", "sunspot".

This means that if your search term is "sun", then a document containing "sunspot" will be matched, since this word has also generated the token "sun".

One can also use NGramFilterFactory for substring search instead just pre-/postfix search.

<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/>

Remember to restart the solr-server and reindex after applying this filter.

2011-01-27: This first post in this discussion should be noted: https://groups.google.com/d/topic/ruby-sunspot/9yTr00NCbxc/discussion