-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Term boosting #268
Comments
Hi @jasonpolites , thanks for your kind words! It am happy to say that it is already possible to apply term boosting with MiniSearch, although admittedly it is not too clear from the documentation. One way is using the const searchOptions = {
fields: ['description'],
combineWith: 'OR',
// Boost documents that contain the term 'foo'
boostDocument: (docId, term) => (term === 'foo') ? 2 : 1
}
miniSearch.search('foo bar baz bazooka', searchOptions) Finally, I do agree that it would be nice to have an easier way to specify term boosting. While it won't be implemented as a string query syntax like in Lucene (for reasons outlined in this past discussion), it is still possible to add a search option. I will brainstorm and hopefully come up with a nice implementation for this feature. |
Thanks for the reply. My example was obviously not a good one, but the premise remains that a field with more matching terms will rank more highly (which generally makes complete sense). It may not be universal across languages, or even across use cases, but in my use case the position of the term in the string carries importance. So the first token is more important than the last one. How would I implement this in the sample you provided? The callback for the In my example, "foo" is only more important because it's the first term, not because it's specifically "foo". What I want to do is boost the first few terms in the search string. E.g.
|
I had originally misunderstood your question, so I edited my answer a bit. Boosting the first terms is not trivial, but feasible with something like this: // Assuming you use the default tokenizer
const tokenize = MiniSearch.getDefault('tokenize')
// Function to search applying a boost to the first query terms
const searchWithBoost = (query, boostFactors = [], searchOptions = {}) => {
const queryTerms = tokenize(query)
const boosts = queryTerms.reduce((boosts, term, i) => {
boosts[term] = boostFactors[i] || 1
return boosts
}, {})
const searchOptionsWithBoost = {
...searchOptions,
boostDocument: (docId, term) => boosts[term] || 1
}
return miniSearch.search(query, searchOptionsWithBoost)
}
// Usage, boosting the first term by a factor of 3 and the second by a factor of 2:
searchWithBoost('foo bar baz bazooka', [3, 2]) Admittedly this is not so simple, I will think about ways to improve this use case. |
Actually this is quite simple, and much better than what I have now as I am using some ugly regex to tokenize the search query, and this is a much more elegant approach. I am not sure how common "term boosting" is as a requirement, and while a I will try this out tonight. Thanks! |
OK.. I had to revise your example a bit, but got it working and it seems to do what I need. I'm not sure what kind of dark magic you're weaving with the callback to the const boosts = queryTerms.reduce({}, (boosts, term, i) => {
boosts[term] = boostFactors[i] || 1
}); ...was a bit of a head-scratcher for me as it didn't match with the common signature for the reduce method I rewrote it a little, and I think the outcome is the same: const tokenize = MiniSearch.getDefault('tokenize');
searchWithBoost(query, boostFactors = [], options =[]) {
const queryTerms = tokenize(query);
const boosts = {};
let boostIndex = 0;
const reducer = (_accumulator, term, _i) => {
if(boostIndex < boostFactors.length && term && term.length > 0 && !boosts[term]) {
boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1
}
}
queryTerms.reduce(reducer, queryTerms[0]);
const searchOptionsWithBoost = {
...options,
boostDocument: (_docId, term) => boosts[term] || 1
}
return miniSearch.search(query, searchOptionsWithBoost);
} I did notice that the default In my actual implementation I also skip any string tokens which "look like" numbers as they end up not being meaningful, but I left that out of my sample to avoid any confusion. Thanks again! (feel free to close this issue, unless you want to keep it open as a reminder to look at a native API for this) Edit: Edit 2 boosts[term.toLowerCase()] = boostFactors[boostIndex++] || 1 |
@jasonpolites you are right, sorry... I wrote that code in GitHub without testing and got the argument order wrong. I also missed a // WRONG:
const boosts = queryTerms.reduce({}, (boosts, term, i) => {
boosts[term] = boostFactors[i] || 1
})
// CORRECT:
const boosts = queryTerms.reduce((boosts, term, i) => {
boosts[term] = boostFactors[i] || 1
return boosts
}, {}) Basically, I am reducing the array of
You actually spotted a bug here, thanks :) basically, a recently merged pull request missed a |
Version |
Term boosting (giving greater or lower importance to specific query terms) was previously not supported. It was technically possible by using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone. This commit adds a new search option, `boostTerm`, which makes it a lot easier to apply term boosting. The option is a function that is invoked with each search term, and is expected to return a numeric boost factor.
I opened a pull request to introduce term boosting as a convenient search option: #274 Once I am happy with the implementation and the documentation, I will probably merge it and make a new release. @jasonpolites this feature will make your use case a lot simpler to solve in the near future. |
Term boosting (giving greater or lower importance to specific query terms) was previously not supported. It was technically possible by using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone. This commit adds a new search option, `boostTerm`, which makes it a lot easier to apply term boosting. The option is a function that is invoked with each search term, and is expected to return a numeric boost factor.
Term boosting (giving greater or lower importance to specific query terms) was previously not supported. It was technically possible by using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone. This commit adds a new search option, `boostTerm`, which makes it a lot easier to apply term boosting. The option is a function that is invoked with each search term, and is expected to return a numeric boost factor.
PR looks great. Very clean/simple. Thanks! |
Term boosting (giving greater or lower importance to specific query terms) was previously not supported. It was technically possible by using the `boostDocument` search option (as shown here: #268) but cumbersome and error prone. This commit adds a new search option, `boostTerm`, which makes it a lot easier to apply term boosting. The option is a function that is invoked with each search term, and is expected to return a numeric boost factor.
@jasonpolites version |
Hi,
First off.. let me say this library is amazing. I spent some time trying to find a good client-side index, and MiniSearch is by far the best I found.
Now that I've buttered you up...
I am implementing a basic document similarity mechanism that just takes the field of an existing document, and re-issues a search with the value that field.
For example:
Imagine a document with the following:
Finding similar items would mean issuing a search like this:
This works, but in my case, the first few terms are more meaningful than the last few. If you imagine the corpus has two other documents:
The second item would (likely) get a higher
score
from MiniSearch. In reality (in my case) the first document should be higher.So.. what I really want is:
The
foo^2
applies a boost to that term. Similar to term boosting in LuceneThe text was updated successfully, but these errors were encountered: