Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement proposal - be permissive about typos when searching #306

Open
rsvoboda opened this issue Jul 18, 2024 · 1 comment
Open

Enhancement proposal - be permissive about typos when searching #306

rsvoboda opened this issue Jul 18, 2024 · 1 comment
Labels
good first issue Good for newcomers

Comments

@rsvoboda
Copy link
Member

I have an enhancement proposal to be permissive about typos when searching.

Here is an example: https://quarkus.io/guides/#q=aplication gives Sorry, no guides matched your search. Please try again.
Same for https://quarkus.io/guides/#q=Configuring+your+application vs. https://quarkus.io/guides/#q=Configuring+your+aplication

Is there a way to tolerate typos because they are quite common, especially for non-native speakers?

Some approximation (I think there was something for it in HS), maybe there is an existing mapping for common English misspelled words, ...

@yrodiere yrodiere added the good first issue Good for newcomers label Jul 19, 2024
@yrodiere
Copy link
Member

Hey,

This would be a nice feature indeed.

there is an existing mapping for common English misspelled words, ...

I don't think a hard coded list will work, no. Fortunately, there are other solutions :)

We need to consider two things IMO: how to match "approximately", and when to match approximatly.

How

Fuzzy queries (which allow terms with one or two typos) are a thing, but I'd personally stay away from them, because:

  1. They reach their limits quite fast, and then you have to switch to a completely different solution.
  2. They are not available everywhere; e.g. I'm not sure we can use them "by default" in the simple query strings we're using right now in search.

A better approach is to have dedicated fields using an ngram analyzer, e.g. turn tokens into a list of 3-grams:

  • Searched: aplication => [apl, pli, lic, ica, ati, tio, ion]
  • Indexed: application => [app, ppl, pli, lic, ica, ati, tio, ion]
  • Common tokens: ``[pli, lic, ica, ati, tio, ion]`; that's enough to get a good score!

When

We could do a "OR" between the current search criteria and the new "fuzzy" ones, but this means that, when searching without typos, we will return a long tail of potentially irrelevant results.

A perhaps better solution would be to run the search without typo support first, and only if we notice that search doesn't match anything, ignore it, then run another search with typo support (more fuzzy), then return the results of that second search.

Resources

I tried to explain how to do ngram search here: https://discourse.hibernate.org/t/slop-does-not-work-for-any-word/9253/6?u=yrodiere

As I mentioned above though, we probably don't want to put all predicates in the same query, but rather do something like this:

var results = doSearchWithoutTypoSupport(params);
if (results.total().hitCountLowerBound() == 0) {
   results = doSearchWithTypoSupportUsingNgrams(params);
}
return results;

PRs welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants