[Feature Request] Term boosting #268

jasonpolites · 2024-07-15T00:03:22Z

Hi,

First off.. let me say this library is amazing. I spent some time trying to find a good client-side index, and MiniSearch is by far the best I found.

Now that I've buttered you up...

I am implementing a basic document similarity mechanism that just takes the field of an existing document, and re-issues a search with the value that field.

For example:

Imagine a document with the following:

{
  "description" : "foo bar baz bazooka"
}

Finding similar items would mean issuing a search like this:

const options = {
  fields: ['description'],
  combineWith: 'OR'
};

minisearch.search("foo bar baz bazooka", options);

This works, but in my case, the first few terms are more meaningful than the last few. If you imagine the corpus has two other documents:

[{
  "description" : "foo bar bosh"
},
{
  "description" : "baz bazooka bar"
}]

The second item would (likely) get a higher score from MiniSearch. In reality (in my case) the first document should be higher.

So.. what I really want is:

minisearch.search("foo^2 bar baz bazooka", options);

The foo^2 applies a boost to that term. Similar to term boosting in Lucene

The text was updated successfully, but these errors were encountered:

lucaong · 2024-07-15T09:47:01Z

Hi @jasonpolites , thanks for your kind words!

It am happy to say that it is already possible to apply term boosting with MiniSearch, although admittedly it is not too clear from the documentation. One way is using the boostDocument search option:

const searchOptions = {
  fields: ['description'],
  combineWith: 'OR',
  // Boost documents that contain the term 'foo'
  boostDocument: (docId, term) => (term === 'foo') ? 2 : 1
}

miniSearch.search('foo bar baz bazooka', searchOptions)

Finally, I do agree that it would be nice to have an easier way to specify term boosting. While it won't be implemented as a string query syntax like in Lucene (for reasons outlined in this past discussion), it is still possible to add a search option. I will brainstorm and hopefully come up with a nice implementation for this feature.

jasonpolites · 2024-07-16T01:11:56Z

Thanks for the reply. My example was obviously not a good one, but the premise remains that a field with more matching terms will rank more highly (which generally makes complete sense). It may not be universal across languages, or even across use cases, but in my use case the position of the term in the string carries importance. So the first token is more important than the last one.

How would I implement this in the sample you provided? The callback for the boostDocument property takes a docId and a term, but is there a way to know the index of the term, and how many terms there are? (e.g. (docId, term, termIndex, termCount))

In my example, "foo" is only more important because it's the first term, not because it's specifically "foo". What I want to do is boost the first few terms in the search string. E.g.

first^3 second^2 third fourth

lucaong · 2024-07-16T09:19:20Z

I had originally misunderstood your question, so I edited my answer a bit.

Boosting the first terms is not trivial, but feasible with something like this:

// Assuming you use the default tokenizer
const tokenize = MiniSearch.getDefault('tokenize')

// Function to search applying a boost to the first query terms
const searchWithBoost = (query, boostFactors = [], searchOptions = {}) => {
  const queryTerms = tokenize(query)

  const boosts = queryTerms.reduce((boosts, term, i) => {
    boosts[term] = boostFactors[i] || 1
    return boosts
  }, {})

  const searchOptionsWithBoost = {
    ...searchOptions,
    boostDocument: (docId, term) => boosts[term] || 1
  }

  return miniSearch.search(query, searchOptionsWithBoost)
}

// Usage, boosting the first term by a factor of 3 and the second by a factor of 2:
searchWithBoost('foo bar baz bazooka', [3, 2])

Admittedly this is not so simple, I will think about ways to improve this use case.

jasonpolites · 2024-07-16T17:16:03Z

Actually this is quite simple, and much better than what I have now as I am using some ugly regex to tokenize the search query, and this is a much more elegant approach.

I am not sure how common "term boosting" is as a requirement, and while a foo^2 style syntax is arguably the simplest (for users of your library), I think just adding a "Term Boosting" section to your docs with some sample code like this would be a quick fix for anyone else looking to do term boosting, ahead of a more "native" implementation.

I will try this out tonight. Thanks!

jasonpolites · 2024-07-17T02:28:13Z

OK.. I had to revise your example a bit, but got it working and it seems to do what I need.

I'm not sure what kind of dark magic you're weaving with the callback to the reduce method on the term array, but this call:

const boosts = queryTerms.reduce({}, (boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
});

...was a bit of a head-scratcher for me as it didn't match with the common signature for the reduce method

I rewrote it a little, and I think the outcome is the same:

const tokenize = MiniSearch.getDefault('tokenize');

searchWithBoost(query, boostFactors = [], options =[]) {

  const queryTerms = tokenize(query);

  const boosts = {};
  let boostIndex = 0;

  const reducer = (_accumulator, term, _i) => {
    if(boostIndex < boostFactors.length && term && term.length > 0 && !boosts[term]) {
      boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1
    }
  }

  queryTerms.reduce(reducer, queryTerms[0]);

  const searchOptionsWithBoost = {
    ...options,
    boostDocument: (_docId, term) => boosts[term] || 1
  }

  return miniSearch.search(query, searchOptionsWithBoost);
}

I did notice that the default tokenizer produces a lot of "empty string" tokens (i.e. ["foo", "", "", "", "bar", "", ""]) which I don't think harm anything (and the reason for my term.length > 0 check). I suspect this is because my query string often contains duplicate whitespace (\s) characters (I don't control the source text, and it's a mess). Could be a small optimization in the engine to omit/skip these empty tokens.

In my actual implementation I also skip any string tokens which "look like" numbers as they end up not being meaningful, but I left that out of my sample to avoid any confusion.

Thanks again! (feel free to close this issue, unless you want to keep it open as a reminder to look at a native API for this)

Edit:
I'm seeing some unexpected results (results that I think should be ranked more highly), but I need to debug to see if this is a problem with the approach, or just something amiss with my specific implementation

Edit 2
🤦‍♂️ Case sensitivity (lots of my query strings are all upper case.. like I said, it's a mess). Updated the original to match

boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1

lucaong · 2024-07-17T13:00:39Z

I'm not sure what kind of dark magic you're weaving with the callback to the reduce method on the term array, but this call:

@jasonpolites you are right, sorry... I wrote that code in GitHub without testing and got the argument order wrong. I also missed a return. Dangers of not checking code 😬 Here's the correction:

// WRONG:
const boosts = queryTerms.reduce({}, (boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
})

// CORRECT:
const boosts = queryTerms.reduce((boosts, term, i) => {
  boosts[term] = boostFactors[i] || 1
  return boosts
}, {})

Basically, I am reducing the array of queryTerms into a map of term to boost factor. If we start from queryTerms = ['foo', 'bar', 'baz'] and boostFactors = [3, 2] we should end up with { foo: 3, bar: 2, baz: 1 }. I will go ahead and edit my code example above to avoid confusing other users reading this thread.

I did notice that the default tokenizer produces a lot of "empty string" tokens (i.e. ["foo", "", "", "", "bar", "", ""]) which I don't think harm anything (and the reason for my term.length > 0 check). I suspect this is because my query string often contains duplicate whitespace (\s) characters (I don't control the source text, and it's a mess). Could be a small optimization in the engine to omit/skip these empty tokens.

You actually spotted a bug here, thanks :) basically, a recently merged pull request missed a + in the regular expression. This is being fixed and a new patch release will be published later today.

lucaong · 2024-07-18T09:38:56Z

Version v7.0.2 is now released on NPM, fixing the regression causing the tokenizer to produce empty terms when multiple contiguous spaces or punctuation characters are present in the input.

Term boosting (giving greater or lower importance to specific query terms) was previously not supported. It was technically possible by using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone. This commit adds a new search option, `boostTerm`, which makes it a lot easier to apply term boosting. The option is a function that is invoked with each search term, and is expected to return a numeric boost factor.

lucaong · 2024-07-19T11:53:14Z

I opened a pull request to introduce term boosting as a convenient search option: #274

Once I am happy with the implementation and the documentation, I will probably merge it and make a new release.

@jasonpolites this feature will make your use case a lot simpler to solve in the near future.

Term boosting (giving greater or lower importance to specific query terms) was previously not supported. It was technically possible by using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone. This commit adds a new search option, `boostTerm`, which makes it a lot easier to apply term boosting. The option is a function that is invoked with each search term, and is expected to return a numeric boost factor.

jasonpolites · 2024-07-19T18:58:21Z

PR looks great. Very clean/simple. Thanks!

Term boosting (giving greater or lower importance to specific query terms) was previously not supported. It was technically possible by using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone. This commit adds a new search option, `boostTerm`, which makes it a lot easier to apply term boosting. The option is a function that is invoked with each search term, and is expected to return a numeric boost factor.

lucaong · 2024-07-22T09:55:19Z

@jasonpolites version v7.1.0 is published on NPM, including including the term boosting feature. I will go on and close this issue, as it should be now properly addressed. Thanks a lot for reporting!

lucaong mentioned this issue Jul 19, 2024

Add search option to apply term boosting #274

Merged

lucaong closed this as completed Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Term boosting #268

[Feature Request] Term boosting #268

jasonpolites commented Jul 15, 2024

lucaong commented Jul 15, 2024 •

edited

Loading

jasonpolites commented Jul 16, 2024

lucaong commented Jul 16, 2024 •

edited

Loading

jasonpolites commented Jul 16, 2024

jasonpolites commented Jul 17, 2024 •

edited

Loading

lucaong commented Jul 17, 2024 •

edited

Loading

lucaong commented Jul 18, 2024

lucaong commented Jul 19, 2024

jasonpolites commented Jul 19, 2024

lucaong commented Jul 22, 2024

[Feature Request] Term boosting #268

[Feature Request] Term boosting #268

Comments

jasonpolites commented Jul 15, 2024

lucaong commented Jul 15, 2024 • edited Loading

jasonpolites commented Jul 16, 2024

lucaong commented Jul 16, 2024 • edited Loading

jasonpolites commented Jul 16, 2024

jasonpolites commented Jul 17, 2024 • edited Loading

lucaong commented Jul 17, 2024 • edited Loading

lucaong commented Jul 18, 2024

lucaong commented Jul 19, 2024

jasonpolites commented Jul 19, 2024

lucaong commented Jul 22, 2024

lucaong commented Jul 15, 2024 •

edited

Loading

lucaong commented Jul 16, 2024 •

edited

Loading

jasonpolites commented Jul 17, 2024 •

edited

Loading

lucaong commented Jul 17, 2024 •

edited

Loading