Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When and where does tokenization / analysis happen? #3

Open
valencik opened this issue Feb 21, 2023 · 3 comments
Open

When and where does tokenization / analysis happen? #3

valencik opened this issue Feb 21, 2023 · 3 comments
Labels
analysis Related to analyzing and tokenizing text

Comments

@valencik
Copy link
Contributor

valencik commented Feb 21, 2023

I went to write the Codec for MultiIndex and ran into a problem.
Currently MultiIndex looks like:

case class MultiIndex(
    indexes: Map[String, TermIndexArray],
    analyzers: Map[String, Analyzer],
    defaultField: String,
    defaultOR: Boolean = true,
)

With the issue being the inclusion of Analyzer.
We can't really serialize Analyzer because it contains a function:

sealed class Analyzer private (
    tokenizer: (String) => Vector[String],
    lowerCase: Boolean,
    stopWords: Set[String],
)

Maybe we could somehow get the serialization to work for the JVM, but I doubt it would be portable to JS and Native.
It's really just a road we don't want to go down.

Why is Analyzer a part of the index at all?
It ended up there during the work to add support for lower casing queries.
In order for "bad" to match "Bad" we use an Analyzer with lowerCase = true at both index time and query time.
This is easy to accomplish if the Analyzer is just baked right into the index.

The path forward is likely to:

  • remove Analyzer from the index
  • continue to require an Analyzer at index build time, just don't save it in the index
  • require queries be analyzed outside of the index before executing them

On this last point, of requiring queries be analyzed, this get's us closer to what Lucene does, which is require an Analyzer for query parsing. I think this is probably how we should do things as well. The end solution here would mean that Lucille needs some notion of analysis as well. Today it basically just does white space tokenization, baked right in. "fast cast" get's parsed as two TermQs.

I think that's good default behaviour for Lucille. So we really just need some way to provide a different analyzer if desired.

@valencik
Copy link
Contributor Author

As an intermediate step I am thinking of writing some sort of QueryAnalyzer that we configure and then it ultimately provides the String => Query function we use at query time.
Something like:

case class QueryAnalyzer(defaultField: String)(
  head: (String, Analyzer),
  tail: (String, Analyzer)*
) {
  def parse(queryString: String): NonEmptyList[Query] = ???
}

So constructing it is very similar to how we're constructing the MultiIndex currently:

  val analyzer = Analyzer.default.withLowerCasing

  val index = MultiIndex.apply[Book](
    ("title", _.title, analyzer),
    ("author", _.author, analyzer),
  )(allBooks)

We should eventually just call that a IndexSchema or something and reuse it in both places.

@samspills
Copy link
Contributor

Capturing some discussion from discord:

  • we could try following in elasticsearch's path and serialize a description of the analyzer
  • with the analyzer builders, users could use the builder to construct the analyzer and then we could serialize the description (so the user doesn't have to write json)
  • we can steal leverage the work / naming decisions elasticsearch has already made here
    • with some consideration for making clear what keys are special and which ones are just random words

@valencik valencik mentioned this issue Mar 27, 2023
2 tasks
@valencik
Copy link
Contributor Author

Analyzer was removed from Index and MultiIndex and is now used:

  • At indexing time, but not stored
  • At query time to rewrite queries with QueryAnalyzer

(See #4)

So the bulk of this work is done. What remains is to leverage the analysis from textmogrify with the parsing from lucille.

@valencik valencik added the analysis Related to analyzing and tokenizing text label Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Related to analyzing and tokenizing text
Projects
None yet
Development

No branches or pull requests

2 participants