[Confluence] Change CQL to remove stop words and search each word individually #494
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What's being changed:
This PR modifies the CQL query used by the Confluence provider. It does three main things:
AND
operator, and resorts toOR
operator if there were no results.Removing stop words makes it so that the connector is not searching for the most common words in the language that don't really add any value to the results. This includes words such as "I", "or", "he", "him", "both", "does", "be", etc. There is not value in individually searching Confluence for these words.
Prior to this PR, the connector used a CQL query like this:
text ~ "does walmart match 401(k) contributions"
This PR changes it to first try:
text ~ "walmart" AND text ~ "match" AND text ~ "401k" AND text ~ "contributions"
Any document with all the keywords is likely more relevant than one containing only one of those keywords. So
AND
operator is prioritized. But if no results come back using theAND
operator, it tries the search again using anOR
operator between the words, like this:text ~ "walmart" OR text ~ "match" OR text ~ "401k" OR text ~ "contributions"
This query is an interesting example, because in the test document I have in Confluence that I am expecting to match, the retirement plan is referred to as "401 (k)". The test document does not match the first query, and in fact falls back to using
OR
and finds the relevant document. However, in many other cases I would expect documents to be found with the first cql query usingAND
.I believe that using
AND
will improve the results when there is more data in the user's Confluence account. With just a handful of test documents, usingOR
tends to return the relevant results, but when there is a lot of real world data in the account, usingOR
would likely return more noise.How did you test this change (include any code snippets, API requests, screenshots, or gifs):
I have been testing in Coral, with a custom connector defined in Cohere Dashboard, that is using an ngrok tunnel to run the connector in my local env. I have also made numerous requests to the connector using an HTTP Client program so I could see what was being returned by the connector more directly.