Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds preprocessor for incoming primo searches #143

Merged
merged 5 commits into from
Dec 5, 2024

Conversation

JPrevost
Copy link
Member

Why are these changes being introduced:

  • Primo has a more complex keyword search capability than tacos is able to understand
  • Rather than adding complexity to what tacos understands to match primo, we are adding a normalization step via source preprocessors

Relevant ticket(s):

How does this address that need:

  • This adds our first source preprocessor to handle incoming searches from primo ui
  • It does not yet handle complex searches (targeted field searching or multiple keywords being combined with boolean logic)
  • When a complex search is detected, we attach it to a shared term 'unhandled complex primo query' so we can better understand the volume of these more complex queries. This should provide us with some initial data to help us prioritize handling more complex queries within tacos.

Summary of changes (please refer to commit messages for full details)

Developer

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-143 November 20, 2024 16:43 Inactive
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-143 November 21, 2024 13:49 Inactive
@matt-bernhardt matt-bernhardt self-assigned this Nov 22, 2024
Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments here are to confirm my assumptions about the change, but there are two things that I suspect might need a tweak - but I'm open to pushback or hearing that I'm off base.

  1. I don't think the check for three elements in the keyword? method is doing anything, given how we are calling that method
  2. I think the comment in the comma_handlermethod isn't accurate in its example

Everything else seems fine, and I particularly like the structure of the extract_phrase method in SearchLogger - that feels nicely extensible for other systems to start contributing, while also managing non-production environments.

My last comment about re-organizing the GraphQL controller test file is non-blocking - if you agree that it seems worth doing, then I'm happy to make a ticket in the backlog for it.

@@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name']
assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence']
end

test 'primo searches use the preprocessor to extract actual keywords' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Would it make sense to have a ticket to re-organize the tests in this file? The two tests added here seem fine, and I see a related test on line 128 that hits the default part of the case statement. Most of the tests are concerned with the logSearchEvent function, but a handful in the middle check the lookupTerm function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a ticket that would sit in a backlog forever and then someone might delete it in a few years after realizing we either already fixed it along the way or it wasn't worth it. I'm not saying you can't open a ticket, I just wouldn't encourage it. This feels like something that when someone is working on this suite and feels like cleaning it up they will more than someone would work on it just because we have a backlog ticket. I'd be happy to chat in a team meeting about whether this type of work should get backlog tickets though as I'm open to different perspectives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant.

split_query = query.split(';;')

if split_query.count > 1
Rails.logger.debug('Multipart primo query detected')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we are returning the presumably-unique string on line 18, which will allow us to get event counts within TACOS. Is there anything we can (or should) do with these debug messages? I don't see them enabled in the review app, and I doubt they'd be present in production.

I suspect it's no problem to leave them in, and they'll just show up for local development work - but in case I'm wrong I wanted to check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will show up in environments in which the log level is set to DEBUG which is the default in development in this app.

In production we default to INFO, but can swap to DEBUG with an ENV change to see something in a PR build (or even in prod) for a bit:
https://github.com/MITLibraries/tacos/blob/main/config/environments/production.rb#L65

More info on Rails log levels

Some of our apps we have put the ENV control of log level in development for when we have a few too many debug logs for normal use. I believe in at least one app we run prod in debug log level mode. Consistency would probably be nice :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.

# after we separate the incoming string into an array based on commas
def self.comma_handler(query_part_array)
# Join the third to the end of the into a string and join by commas
# ex: any,contains,I,am,a,search,with,lots,of,commas -> I am a search with lots of commas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is accurate? When I submit the string any,contains,I,am,a,search,with,lots,of,commas I get back I,am,a,search,with,lots,of,commas from the API - which is the value I expected based on line 67 and the tests down below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look and correct the example as appropriate. This is likely because I added a second commit to change where the collapse of a multi-comma'd search gets handled and didn't update the comment. Good catch!

def self.extract_keyword(query_part)
query_part_array = query_part.split(',')

return 'invalid primo query' unless query_part_array.count >= 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are returning a different value from this clause because we expect this condition to be erroneous, rather than an expected behavior that we need to gauge frequencies for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And similarly to the feedback I provided to you on your open PR, maybe this means we log an exception to Sentry as we don't expect this to happen... it's literally exceptional (unless it isn't because it happens to much at which point we don't understand the system and need to fix it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this to have a Sentry log before merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've update to send event to Sentry if this unexpected condition happens so we can better understand it


the_keywords = comma_handler(query_part_array)

return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this feels awkward to me. Explicitly composing the array in this way while calling keyword?, and then checking there to confirm that there are only three elements, doesn't seem like it accomplishes anything? If the_keywords doesn't get parsed correctly, the argument is still only a three-element array - but that third element might not be a single string - maybe ['foo', 'bar', ['baz', 'none']] and not ['foo', 'bar', 'baz', 'none'], since the_keywords would only ever be one thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look. There was a refactor involving this that may have been a bit lazy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I'm super confused what's going on and how anything works at all. I'll fix the behavior and/or tests and/or docs and let you know when this is ready... sorry for whatever this mess is!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I updated the docs and names of methods to make this more clear what is going on. If not, please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below.

test 'keyword? returns false for input with more than 3 array elements' do
# NOTE: this query entering tacos would work... but it would have been cleaned up prior to running
# keyword? in our application via the normal flow
input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment about the calling of this method never being able to match the input of this test - we're explicitly composing the argument as a three-element array, if I follow the logic above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're composing it in one place that way...but if the method is reused elsewhere or there is a refactor that changes how we are calling it, it's important to have tests about our expectations for the method independently (unless it's a private method) to ensure we don't break something by changing this behavior.

I'll take a look at this to see if there is a way to make it seem less weird though. I did a sort of late refactor and it's possibly I only half way completed what was in my brain and thus caused this (if you look at the first commit alone and the changes in the second you can probably see how it may have made sense to do it this way at first but might not after the change... I'll check).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and that's a fair reason to have a test like this.

@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-143 November 22, 2024 21:29 Inactive
@JPrevost JPrevost force-pushed the tco-113-primo-preprocessor branch from 6a4ff24 to e173db0 Compare November 26, 2024 20:11
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-143 November 26, 2024 20:11 Inactive
@JPrevost
Copy link
Member Author

@matt-bernhardt I've updated the documentation and made some other small changes. Please take a look when you can to see if this fully addresses your concerns.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-143 December 2, 2024 20:40 Inactive
Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for talking through my questions, for adding a Sentry call on the invalid query branch, and updating the code comments / method names. I think all my concerns have been addressed by this point.

Looking forward to seeing this in production, and continuing to iterate on how to receive traffic from multiple sources. This puts us on a good foundation for that sort of work.

# @param query [String] example `any,contains,this is a keyword search`
def self.to_tacos(query)
# split on agreed upon joiner `;;`
split_query = query.split(';;')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the filtering that happens before this, I don't think anything needs an adjustment - but I just peeked at our data and there are about a dozen examples of terms already that have ;; in them. All of these are obviously from not-Primo, and that traffic would not get routed to this model, but it might be a pathway to keep in mind as we investigate anything marked as "unhandled complex primo query"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I'll take a closer look and see if there is a better separator to choose that is less likely to exist in an incoming search (I chose this, it's not something Primo provided by default so we can adjust it). Thanks for noticing that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;;; seems fine. I've asked AdamS to update the primo integration to use that and will push that change here as well.

split_query = query.split(';;')

if split_query.count > 1
Rails.logger.debug('Multipart primo query detected')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.


the_keywords = comma_handler(query_part_array)

return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below.

@@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name']
assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence']
end

test 'primo searches use the preprocessor to extract actual keywords' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant.

test 'keyword? returns false for input with more than 3 array elements' do
# NOTE: this query entering tacos would work... but it would have been cleaned up prior to running
# keyword? in our application via the normal flow
input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and that's a fair reason to have a test like this.

Why are these changes being introduced:

* Primo has a more complex keyword search capability than tacos is
  able to understand
* Rather than adding complexity to what tacos understands to match
  primo, we are adding a normalization step via source preprocessors

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-113

How does this address that need:

* This adds our first source preprocessor to handle incoming searches
  from primo ui
* It does not yet handle complex searches (targeted field searching or
  multiple keywords being combined with boolean logic)
* When a complex search is detected, we attach it to a shared term
  'unhandled complex primo query' so we can better understand the
  volume of these more complex queries. This should provide us with some
  initial data to help us prioritize handling more complex queries
  within tacos.
Adds some additional methods and documentation to clarify why some things work as they do
and where a few future pitfalls exist

Removes some unnecessary comments that were unintentionally included previously
@JPrevost JPrevost force-pushed the tco-113-primo-preprocessor branch from e173db0 to ff94e13 Compare December 5, 2024 20:52
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-143 December 5, 2024 20:52 Inactive
@JPrevost JPrevost merged commit 5fbfd28 into main Dec 5, 2024
6 checks passed
@JPrevost JPrevost deleted the tco-113-primo-preprocessor branch December 5, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants