When and where does tokenization / analysis happen? #3

valencik · 2023-02-21T12:39:30Z

I went to write the Codec for MultiIndex and ran into a problem.
Currently MultiIndex looks like:

case class MultiIndex(
    indexes: Map[String, TermIndexArray],
    analyzers: Map[String, Analyzer],
    defaultField: String,
    defaultOR: Boolean = true,
)

With the issue being the inclusion of Analyzer.
We can't really serialize Analyzer because it contains a function:

sealed class Analyzer private (
    tokenizer: (String) => Vector[String],
    lowerCase: Boolean,
    stopWords: Set[String],
)

Maybe we could somehow get the serialization to work for the JVM, but I doubt it would be portable to JS and Native.
It's really just a road we don't want to go down.

Why is Analyzer a part of the index at all?
It ended up there during the work to add support for lower casing queries.
In order for "bad" to match "Bad" we use an Analyzer with lowerCase = true at both index time and query time.
This is easy to accomplish if the Analyzer is just baked right into the index.

The path forward is likely to:

remove Analyzer from the index
continue to require an Analyzer at index build time, just don't save it in the index
require queries be analyzed outside of the index before executing them

On this last point, of requiring queries be analyzed, this get's us closer to what Lucene does, which is require an Analyzer for query parsing. I think this is probably how we should do things as well. The end solution here would mean that Lucille needs some notion of analysis as well. Today it basically just does white space tokenization, baked right in. "fast cast" get's parsed as two TermQs.

I think that's good default behaviour for Lucille. So we really just need some way to provide a different analyzer if desired.

The text was updated successfully, but these errors were encountered:

valencik · 2023-02-21T13:17:33Z

As an intermediate step I am thinking of writing some sort of QueryAnalyzer that we configure and then it ultimately provides the String => Query function we use at query time.
Something like:

case class QueryAnalyzer(defaultField: String)(
  head: (String, Analyzer),
  tail: (String, Analyzer)*
) {
  def parse(queryString: String): NonEmptyList[Query] = ???
}

So constructing it is very similar to how we're constructing the MultiIndex currently:

  val analyzer = Analyzer.default.withLowerCasing

  val index = MultiIndex.apply[Book](
    ("title", _.title, analyzer),
    ("author", _.author, analyzer),
  )(allBooks)

We should eventually just call that a IndexSchema or something and reuse it in both places.

samspills · 2023-02-22T02:07:15Z

Capturing some discussion from discord:

we could try following in elasticsearch's path and serialize a description of the analyzer
with the analyzer builders, users could use the builder to construct the analyzer and then we could serialize the description (so the user doesn't have to write json)
we can ~~steal~~ leverage the work / naming decisions elasticsearch has already made here
- with some consideration for making clear what keys are special and which ones are just random words

valencik · 2023-03-30T09:35:03Z

Analyzer was removed from Index and MultiIndex and is now used:

At indexing time, but not stored
At query time to rewrite queries with QueryAnalyzer

(See #4)

So the bulk of this work is done. What remains is to leverage the analysis from textmogrify with the parsing from lucille.

valencik mentioned this issue Mar 27, 2023

Roadmap to launch #32

Open

2 tasks

valencik added the analysis Related to analyzing and tokenizing text label Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When and where does tokenization / analysis happen? #3

When and where does tokenization / analysis happen? #3

valencik commented Feb 21, 2023 •

edited

Loading

valencik commented Feb 21, 2023

samspills commented Feb 22, 2023

valencik commented Mar 30, 2023

When and where does tokenization / analysis happen? #3

When and where does tokenization / analysis happen? #3

Comments

valencik commented Feb 21, 2023 • edited Loading

valencik commented Feb 21, 2023

samspills commented Feb 22, 2023

valencik commented Mar 30, 2023

valencik commented Feb 21, 2023 •

edited

Loading