Enhancement proposal - be permissive about typos when searching #306

rsvoboda · 2024-07-18T13:44:18Z

I have an enhancement proposal to be permissive about typos when searching.

Here is an example: https://quarkus.io/guides/#q=aplication gives Sorry, no guides matched your search. Please try again.
Same for https://quarkus.io/guides/#q=Configuring+your+application vs. https://quarkus.io/guides/#q=Configuring+your+aplication

Is there a way to tolerate typos because they are quite common, especially for non-native speakers?

Some approximation (I think there was something for it in HS), maybe there is an existing mapping for common English misspelled words, ...

The text was updated successfully, but these errors were encountered:

yrodiere · 2024-07-19T07:25:26Z

Hey,

This would be a nice feature indeed.

there is an existing mapping for common English misspelled words, ...

I don't think a hard coded list will work, no. Fortunately, there are other solutions :)

We need to consider two things IMO: how to match "approximately", and when to match approximatly.

How

Fuzzy queries (which allow terms with one or two typos) are a thing, but I'd personally stay away from them, because:

They reach their limits quite fast, and then you have to switch to a completely different solution.
They are not available everywhere; e.g. I'm not sure we can use them "by default" in the simple query strings we're using right now in search.

A better approach is to have dedicated fields using an ngram analyzer, e.g. turn tokens into a list of 3-grams:

Searched: aplication => [apl, pli, lic, ica, ati, tio, ion]
Indexed: application => [app, ppl, pli, lic, ica, ati, tio, ion]
Common tokens: ``[pli, lic, ica, ati, tio, ion]`; that's enough to get a good score!

When

We could do a "OR" between the current search criteria and the new "fuzzy" ones, but this means that, when searching without typos, we will return a long tail of potentially irrelevant results.

A perhaps better solution would be to run the search without typo support first, and only if we notice that search doesn't match anything, ignore it, then run another search with typo support (more fuzzy), then return the results of that second search.

Resources

I tried to explain how to do ngram search here: https://discourse.hibernate.org/t/slop-does-not-work-for-any-word/9253/6?u=yrodiere

As I mentioned above though, we probably don't want to put all predicates in the same query, but rather do something like this:

var results = doSearchWithoutTypoSupport(params);
if (results.total().hitCountLowerBound() == 0) {
   results = doSearchWithTypoSupportUsingNgrams(params);
}
return results;

PRs welcome :)

yrodiere added the good first issue Good for newcomers label Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement proposal - be permissive about typos when searching #306

Enhancement proposal - be permissive about typos when searching #306

rsvoboda commented Jul 18, 2024

yrodiere commented Jul 19, 2024

Enhancement proposal - be permissive about typos when searching #306

Enhancement proposal - be permissive about typos when searching #306

Comments

rsvoboda commented Jul 18, 2024

yrodiere commented Jul 19, 2024

How

When

Resources