-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds preprocessor for incoming primo searches #143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments here are to confirm my assumptions about the change, but there are two things that I suspect might need a tweak - but I'm open to pushback or hearing that I'm off base.
- I don't think the check for three elements in the
keyword?
method is doing anything, given how we are calling that method - I think the comment in the
comma_handler
method isn't accurate in its example
Everything else seems fine, and I particularly like the structure of the extract_phrase
method in SearchLogger - that feels nicely extensible for other systems to start contributing, while also managing non-production environments.
My last comment about re-organizing the GraphQL controller test file is non-blocking - if you agree that it seems worth doing, then I'm happy to make a ticket in the backlog for it.
@@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest | |||
assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name'] | |||
assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence'] | |||
end | |||
|
|||
test 'primo searches use the preprocessor to extract actual keywords' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: Would it make sense to have a ticket to re-organize the tests in this file? The two tests added here seem fine, and I see a related test on line 128 that hits the default part of the case statement. Most of the tests are concerned with the logSearchEvent function, but a handful in the middle check the lookupTerm function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a ticket that would sit in a backlog forever and then someone might delete it in a few years after realizing we either already fixed it along the way or it wasn't worth it. I'm not saying you can't open a ticket, I just wouldn't encourage it. This feels like something that when someone is working on this suite and feels like cleaning it up they will more than someone would work on it just because we have a backlog ticket. I'd be happy to chat in a team meeting about whether this type of work should get backlog tickets though as I'm open to different perspectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant.
split_query = query.split(';;') | ||
|
||
if split_query.count > 1 | ||
Rails.logger.debug('Multipart primo query detected') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we are returning the presumably-unique string on line 18, which will allow us to get event counts within TACOS. Is there anything we can (or should) do with these debug messages? I don't see them enabled in the review app, and I doubt they'd be present in production.
I suspect it's no problem to leave them in, and they'll just show up for local development work - but in case I'm wrong I wanted to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will show up in environments in which the log level is set to DEBUG which is the default in development in this app.
In production we default to INFO, but can swap to DEBUG with an ENV change to see something in a PR build (or even in prod) for a bit:
https://github.com/MITLibraries/tacos/blob/main/config/environments/production.rb#L65
More info on Rails log levels
Some of our apps we have put the ENV control of log level in development for when we have a few too many debug logs for normal use. I believe in at least one app we run prod in debug log level mode. Consistency would probably be nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.
app/models/preprocessor_primo.rb
Outdated
# after we separate the incoming string into an array based on commas | ||
def self.comma_handler(query_part_array) | ||
# Join the third to the end of the into a string and join by commas | ||
# ex: any,contains,I,am,a,search,with,lots,of,commas -> I am a search with lots of commas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment is accurate? When I submit the string any,contains,I,am,a,search,with,lots,of,commas
I get back I,am,a,search,with,lots,of,commas
from the API - which is the value I expected based on line 67 and the tests down below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look and correct the example as appropriate. This is likely because I added a second commit to change where the collapse of a multi-comma'd search gets handled and didn't update the comment. Good catch!
app/models/preprocessor_primo.rb
Outdated
def self.extract_keyword(query_part) | ||
query_part_array = query_part.split(',') | ||
|
||
return 'invalid primo query' unless query_part_array.count >= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are returning a different value from this clause because we expect this condition to be erroneous, rather than an expected behavior that we need to gauge frequencies for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And similarly to the feedback I provided to you on your open PR, maybe this means we log an exception to Sentry as we don't expect this to happen... it's literally exceptional (unless it isn't because it happens to much at which point we don't understand the system and need to fix it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this to have a Sentry log before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've update to send event to Sentry if this unexpected condition happens so we can better understand it
|
||
the_keywords = comma_handler(query_part_array) | ||
|
||
return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something about this feels awkward to me. Explicitly composing the array in this way while calling keyword?
, and then checking there to confirm that there are only three elements, doesn't seem like it accomplishes anything? If the_keywords
doesn't get parsed correctly, the argument is still only a three-element array - but that third element might not be a single string - maybe ['foo', 'bar', ['baz', 'none']]
and not ['foo', 'bar', 'baz', 'none']
, since the_keywords
would only ever be one thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look. There was a refactor involving this that may have been a bit lazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I'm super confused what's going on and how anything works at all. I'll fix the behavior and/or tests and/or docs and let you know when this is ready... sorry for whatever this mess is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I updated the docs and names of methods to make this more clear what is going on. If not, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below.
test 'keyword? returns false for input with more than 3 array elements' do | ||
# NOTE: this query entering tacos would work... but it would have been cleaned up prior to running | ||
# keyword? in our application via the normal flow | ||
input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment about the calling of this method never being able to match the input of this test - we're explicitly composing the argument as a three-element array, if I follow the logic above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're composing it in one place that way...but if the method is reused elsewhere or there is a refactor that changes how we are calling it, it's important to have tests about our expectations for the method independently (unless it's a private method) to ensure we don't break something by changing this behavior.
I'll take a look at this to see if there is a way to make it seem less weird though. I did a sort of late refactor and it's possibly I only half way completed what was in my brain and thus caused this (if you look at the first commit alone and the changes in the second you can probably see how it may have made sense to do it this way at first but might not after the change... I'll check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and that's a fair reason to have a test like this.
6a4ff24
to
e173db0
Compare
@matt-bernhardt I've updated the documentation and made some other small changes. Please take a look when you can to see if this fully addresses your concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for talking through my questions, for adding a Sentry call on the invalid query branch, and updating the code comments / method names. I think all my concerns have been addressed by this point.
Looking forward to seeing this in production, and continuing to iterate on how to receive traffic from multiple sources. This puts us on a good foundation for that sort of work.
app/models/preprocessor_primo.rb
Outdated
# @param query [String] example `any,contains,this is a keyword search` | ||
def self.to_tacos(query) | ||
# split on agreed upon joiner `;;` | ||
split_query = query.split(';;') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the filtering that happens before this, I don't think anything needs an adjustment - but I just peeked at our data and there are about a dozen examples of terms already that have ;;
in them. All of these are obviously from not-Primo, and that traffic would not get routed to this model, but it might be a pathway to keep in mind as we investigate anything marked as "unhandled complex primo query"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I'll take a closer look and see if there is a better separator to choose that is less likely to exist in an incoming search (I chose this, it's not something Primo provided by default so we can adjust it). Thanks for noticing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;;;
seems fine. I've asked AdamS to update the primo integration to use that and will push that change here as well.
split_query = query.split(';;') | ||
|
||
if split_query.count > 1 | ||
Rails.logger.debug('Multipart primo query detected') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.
|
||
the_keywords = comma_handler(query_part_array) | ||
|
||
return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below.
@@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest | |||
assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name'] | |||
assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence'] | |||
end | |||
|
|||
test 'primo searches use the preprocessor to extract actual keywords' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant.
test 'keyword? returns false for input with more than 3 array elements' do | ||
# NOTE: this query entering tacos would work... but it would have been cleaned up prior to running | ||
# keyword? in our application via the normal flow | ||
input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and that's a fair reason to have a test like this.
Why are these changes being introduced: * Primo has a more complex keyword search capability than tacos is able to understand * Rather than adding complexity to what tacos understands to match primo, we are adding a normalization step via source preprocessors Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-113 How does this address that need: * This adds our first source preprocessor to handle incoming searches from primo ui * It does not yet handle complex searches (targeted field searching or multiple keywords being combined with boolean logic) * When a complex search is detected, we attach it to a shared term 'unhandled complex primo query' so we can better understand the volume of these more complex queries. This should provide us with some initial data to help us prioritize handling more complex queries within tacos.
Adds some additional methods and documentation to clarify why some things work as they do and where a few future pitfalls exist Removes some unnecessary comments that were unintentionally included previously
e173db0
to
ff94e13
Compare
Why are these changes being introduced:
Relevant ticket(s):
How does this address that need:
Summary of changes (please refer to commit messages for full details)
Developer
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
NO dependencies are updated
NO migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing