Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Term boosting #268

Closed
jasonpolites opened this issue Jul 15, 2024 · 10 comments
Closed

[Feature Request] Term boosting #268

jasonpolites opened this issue Jul 15, 2024 · 10 comments

Comments

@jasonpolites
Copy link

Hi,

First off.. let me say this library is amazing. I spent some time trying to find a good client-side index, and MiniSearch is by far the best I found.

Now that I've buttered you up...

I am implementing a basic document similarity mechanism that just takes the field of an existing document, and re-issues a search with the value that field.

For example:

Imagine a document with the following:

{
  "description" : "foo bar baz bazooka"
}

Finding similar items would mean issuing a search like this:

const options = {
  fields: ['description'],
  combineWith: 'OR'
};

minisearch.search("foo bar baz bazooka", options);

This works, but in my case, the first few terms are more meaningful than the last few. If you imagine the corpus has two other documents:

[{
  "description" : "foo bar bosh"
},
{
  "description" : "baz bazooka bar"
}]

The second item would (likely) get a higher score from MiniSearch. In reality (in my case) the first document should be higher.

So.. what I really want is:

minisearch.search("foo^2 bar baz bazooka", options);

The foo^2 applies a boost to that term. Similar to term boosting in Lucene

@lucaong
Copy link
Owner

lucaong commented Jul 15, 2024

Hi @jasonpolites , thanks for your kind words!

It am happy to say that it is already possible to apply term boosting with MiniSearch, although admittedly it is not too clear from the documentation. One way is using the boostDocument search option:

const searchOptions = {
  fields: ['description'],
  combineWith: 'OR',
  // Boost documents that contain the term 'foo'
  boostDocument: (docId, term) => (term === 'foo') ? 2 : 1
}

miniSearch.search('foo bar baz bazooka', searchOptions)

Finally, I do agree that it would be nice to have an easier way to specify term boosting. While it won't be implemented as a string query syntax like in Lucene (for reasons outlined in this past discussion), it is still possible to add a search option. I will brainstorm and hopefully come up with a nice implementation for this feature.

@jasonpolites
Copy link
Author

Thanks for the reply. My example was obviously not a good one, but the premise remains that a field with more matching terms will rank more highly (which generally makes complete sense). It may not be universal across languages, or even across use cases, but in my use case the position of the term in the string carries importance. So the first token is more important than the last one.

How would I implement this in the sample you provided? The callback for the boostDocument property takes a docId and a term, but is there a way to know the index of the term, and how many terms there are? (e.g. (docId, term, termIndex, termCount))

In my example, "foo" is only more important because it's the first term, not because it's specifically "foo". What I want to do is boost the first few terms in the search string. E.g.

first^3 second^2 third fourth

@lucaong
Copy link
Owner

lucaong commented Jul 16, 2024

I had originally misunderstood your question, so I edited my answer a bit.

Boosting the first terms is not trivial, but feasible with something like this:

// Assuming you use the default tokenizer
const tokenize = MiniSearch.getDefault('tokenize')

// Function to search applying a boost to the first query terms
const searchWithBoost = (query, boostFactors = [], searchOptions = {}) => {
  const queryTerms = tokenize(query)

  const boosts = queryTerms.reduce((boosts, term, i) => {
    boosts[term] = boostFactors[i] || 1
    return boosts
  }, {})

  const searchOptionsWithBoost = {
    ...searchOptions,
    boostDocument: (docId, term) => boosts[term] || 1
  }

  return miniSearch.search(query, searchOptionsWithBoost)
}

// Usage, boosting the first term by a factor of 3 and the second by a factor of 2:
searchWithBoost('foo bar baz bazooka', [3, 2])

Admittedly this is not so simple, I will think about ways to improve this use case.

@jasonpolites
Copy link
Author

Actually this is quite simple, and much better than what I have now as I am using some ugly regex to tokenize the search query, and this is a much more elegant approach.

I am not sure how common "term boosting" is as a requirement, and while a foo^2 style syntax is arguably the simplest (for users of your library), I think just adding a "Term Boosting" section to your docs with some sample code like this would be a quick fix for anyone else looking to do term boosting, ahead of a more "native" implementation.

I will try this out tonight. Thanks!

@jasonpolites
Copy link
Author

jasonpolites commented Jul 17, 2024

OK.. I had to revise your example a bit, but got it working and it seems to do what I need.

I'm not sure what kind of dark magic you're weaving with the callback to the reduce method on the term array, but this call:

const boosts = queryTerms.reduce({}, (boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
});

...was a bit of a head-scratcher for me as it didn't match with the common signature for the reduce method

I rewrote it a little, and I think the outcome is the same:

const tokenize = MiniSearch.getDefault('tokenize');

searchWithBoost(query, boostFactors = [], options =[]) {

  const queryTerms = tokenize(query);

  const boosts = {};
  let boostIndex = 0;

  const reducer = (_accumulator, term, _i) => {
    if(boostIndex < boostFactors.length && term && term.length > 0 && !boosts[term]) {
      boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1
    }
  }

  queryTerms.reduce(reducer, queryTerms[0]);

  const searchOptionsWithBoost = {
    ...options,
    boostDocument: (_docId, term) => boosts[term] || 1
  }

  return miniSearch.search(query, searchOptionsWithBoost);
}

I did notice that the default tokenizer produces a lot of "empty string" tokens (i.e. ["foo", "", "", "", "bar", "", ""]) which I don't think harm anything (and the reason for my term.length > 0 check). I suspect this is because my query string often contains duplicate whitespace (\s) characters (I don't control the source text, and it's a mess). Could be a small optimization in the engine to omit/skip these empty tokens.

In my actual implementation I also skip any string tokens which "look like" numbers as they end up not being meaningful, but I left that out of my sample to avoid any confusion.

Thanks again! (feel free to close this issue, unless you want to keep it open as a reminder to look at a native API for this)

Edit:
I'm seeing some unexpected results (results that I think should be ranked more highly), but I need to debug to see if this is a problem with the approach, or just something amiss with my specific implementation

Edit 2
🤦‍♂️ Case sensitivity (lots of my query strings are all upper case.. like I said, it's a mess). Updated the original to match

boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1

@lucaong
Copy link
Owner

lucaong commented Jul 17, 2024

I'm not sure what kind of dark magic you're weaving with the callback to the reduce method on the term array, but this call:

@jasonpolites you are right, sorry... I wrote that code in GitHub without testing and got the argument order wrong. I also missed a return. Dangers of not checking code 😬 Here's the correction:

// WRONG:
const boosts = queryTerms.reduce({}, (boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
})

// CORRECT:
const boosts = queryTerms.reduce((boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
  return boosts
}, {})

Basically, I am reducing the array of queryTerms into a map of term to boost factor. If we start from queryTerms = ['foo', 'bar', 'baz'] and boostFactors = [3, 2] we should end up with { foo: 3, bar: 2, baz: 1 }. I will go ahead and edit my code example above to avoid confusing other users reading this thread.

I did notice that the default tokenizer produces a lot of "empty string" tokens (i.e. ["foo", "", "", "", "bar", "", ""]) which I don't think harm anything (and the reason for my term.length > 0 check). I suspect this is because my query string often contains duplicate whitespace (\s) characters (I don't control the source text, and it's a mess). Could be a small optimization in the engine to omit/skip these empty tokens.

You actually spotted a bug here, thanks :) basically, a recently merged pull request missed a + in the regular expression. This is being fixed and a new patch release will be published later today.

@lucaong
Copy link
Owner

lucaong commented Jul 18, 2024

Version v7.0.2 is now released on NPM, fixing the regression causing the tokenizer to produce empty terms when multiple contiguous spaces or punctuation characters are present in the input.

lucaong added a commit that referenced this issue Jul 19, 2024
Term boosting (giving greater or lower importance to specific query
terms) was previously not supported. It was technically possible by
using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone.

This commit adds a new search option, `boostTerm`, which makes it a lot
easier to apply term boosting. The option is a function that is invoked
with each search term, and is expected to return a numeric boost factor.
@lucaong
Copy link
Owner

lucaong commented Jul 19, 2024

I opened a pull request to introduce term boosting as a convenient search option: #274

Once I am happy with the implementation and the documentation, I will probably merge it and make a new release.

@jasonpolites this feature will make your use case a lot simpler to solve in the near future.

lucaong added a commit that referenced this issue Jul 19, 2024
Term boosting (giving greater or lower importance to specific query
terms) was previously not supported. It was technically possible by
using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone.

This commit adds a new search option, `boostTerm`, which makes it a lot
easier to apply term boosting. The option is a function that is invoked
with each search term, and is expected to return a numeric boost factor.
lucaong added a commit that referenced this issue Jul 19, 2024
Term boosting (giving greater or lower importance to specific query
terms) was previously not supported. It was technically possible by
using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone.

This commit adds a new search option, `boostTerm`, which makes it a lot
easier to apply term boosting. The option is a function that is invoked
with each search term, and is expected to return a numeric boost factor.
@jasonpolites
Copy link
Author

PR looks great. Very clean/simple. Thanks!

lucaong added a commit that referenced this issue Jul 22, 2024
Term boosting (giving greater or lower importance to specific query
terms) was previously not supported. It was technically possible by
using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone.

This commit adds a new search option, `boostTerm`, which makes it a lot
easier to apply term boosting. The option is a function that is invoked
with each search term, and is expected to return a numeric boost factor.
@lucaong
Copy link
Owner

lucaong commented Jul 22, 2024

@jasonpolites version v7.1.0 is published on NPM, including including the term boosting feature. I will go on and close this issue, as it should be now properly addressed. Thanks a lot for reporting!

@lucaong lucaong closed this as completed Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants