Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Confluence] Change CQL to remove stop words and search each word individually #494

Merged
merged 4 commits into from
Oct 8, 2024

Conversation

scottmx81
Copy link
Contributor

What's being changed:

This PR modifies the CQL query used by the Confluence provider. It does three main things:

  • Remove stop words from the query
  • Search for each word individually instead of as a single phrase
  • Tries first using an AND operator, and resorts to OR operator if there were no results.

Removing stop words makes it so that the connector is not searching for the most common words in the language that don't really add any value to the results. This includes words such as "I", "or", "he", "him", "both", "does", "be", etc. There is not value in individually searching Confluence for these words.

Prior to this PR, the connector used a CQL query like this:

text ~ "does walmart match 401(k) contributions"

This PR changes it to first try:

text ~ "walmart" AND text ~ "match" AND text ~ "401k" AND text ~ "contributions"

Any document with all the keywords is likely more relevant than one containing only one of those keywords. So AND operator is prioritized. But if no results come back using the AND operator, it tries the search again using an OR operator between the words, like this:

text ~ "walmart" OR text ~ "match" OR text ~ "401k" OR text ~ "contributions"

This query is an interesting example, because in the test document I have in Confluence that I am expecting to match, the retirement plan is referred to as "401 (k)". The test document does not match the first query, and in fact falls back to using OR and finds the relevant document. However, in many other cases I would expect documents to be found with the first cql query using AND.

I believe that using AND will improve the results when there is more data in the user's Confluence account. With just a handful of test documents, using OR tends to return the relevant results, but when there is a lot of real world data in the account, using OR would likely return more noise.

How did you test this change (include any code snippets, API requests, screenshots, or gifs):

I have been testing in Coral, with a custom connector defined in Cohere Dashboard, that is using an ngrok tunnel to run the connector in my local env. I have also made numerous requests to the connector using an HTTP Client program so I could see what was being returned by the connector more directly.

@scottmx81 scottmx81 requested a review from a team as a code owner September 27, 2024 18:18
Copy link
Collaborator

@walterbm-cohere walterbm-cohere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Good addition

@walterbm-cohere walterbm-cohere merged commit d8c57fa into cohere-ai:main Oct 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants