-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds preprocessor for incoming primo searches #143
Changes from all commits
b2ea9d2
e41f3f2
ab886fb
6cace05
ff94e13
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# frozen_string_literal: true | ||
|
||
# PreprocessorPrimo handles manipulating incoming data from the Primo UI into a structure that TACOS can work with | ||
class PreprocessorPrimo | ||
# to_tacos processes raw incoming query from Primo, looks at each part to see if it is a keyword anywhere search | ||
# Any portion that is not a keyword anywhere search drops the entire search from TACOS, logging | ||
# as the shared Term `unhandled complex primo query` to allow us to track how frequently we are | ||
# dropping terms so we can come back later to build out more complex handing if this is common enough | ||
# to warrant the additional work. | ||
# @param query [String] example `any,contains,this is a keyword search` | ||
def self.to_tacos(query) | ||
# Primo and TACOS agreed upon joiner is `;;;` | ||
split_query = query.split(';;;') | ||
|
||
if split_query.count > 1 | ||
Rails.logger.debug('Multipart primo query detected') | ||
|
||
# As we are not currently handling complex queries, always set the value to something we can track frequency of | ||
'unhandled complex primo query' | ||
else | ||
Rails.logger.debug('Simple primo query detected') | ||
|
||
extract_keyword(query) | ||
end | ||
end | ||
|
||
# keyword? confirms whether a portion of a primo query is a keyword search | ||
# Note: we expect only 3 elements to this array for simple keyword searches and that arrays created from the Primo | ||
# input to be collapsed so commas in the original search have been handled via the comma_handler method | ||
# @param query_part_array [Array] example ['any', 'contains', 'this is a keyword search'] | ||
# @return [Boolean] | ||
def self.keyword?(query_part_array) | ||
return false unless query_part_array.count == 3 | ||
return false unless query_part_array[0] == 'any' | ||
|
||
# For now, we are allowing all variants of the second portion of the primo query input | ||
# The expected values are: contains, exact, begins_with, equals | ||
# Uncommenting the following statement would allow us to restrict to just the default 'contains' if desireable | ||
# | ||
# return false unless query_part_array[1] == 'contains' | ||
|
||
true | ||
end | ||
|
||
# extract_keyword works at the level of a single keyword query input coming from primo and | ||
# returns a string with just that keyword with the operators removed | ||
# @param query_part [String] example `any,contains,this is a keyword search` | ||
# @return [String] the extracted keyword phrase | ||
def self.extract_keyword(query_part) | ||
query_part_array = query_part.split(',') | ||
|
||
# We don't anticipate this being a normal state so we are tracking it under the Term `invalid primo query` as well | ||
# as sending an exception to Sentry so we can understand the context in which this happens if it does | ||
if query_part_array.count < 3 | ||
Sentry.capture_message('PreprocessorPrimo: Invalid Primo query during keyword extraction') | ||
return 'invalid primo query' | ||
end | ||
|
||
the_keywords = join_keyword_and_drop_extra_parts(query_part_array) | ||
|
||
return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something about this feels awkward to me. Explicitly composing the array in this way while calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll take a look. There was a refactor involving this that may have been a bit lazy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah... I'm super confused what's going on and how anything works at all. I'll fix the behavior and/or tests and/or docs and let you know when this is ready... sorry for whatever this mess is! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I updated the docs and names of methods to make this more clear what is going on. If not, please let me know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below. |
||
|
||
the_keywords | ||
end | ||
|
||
# join_keyword_and_drop_extra_parts handles the logic necessary to join searches that contain commas into a single ruby string | ||
# after we separate the incoming string into an array based on commas | ||
# @param query_part [String] example `['any', 'contains', 'this', 'is', 'a', 'keyword', 'search']` | ||
# @return [String] example 'this,is,a,keyword,search' | ||
def self.join_keyword_and_drop_extra_parts(query_part_array) | ||
# For complex queries, which we are not handling yet, we'll need to determine how TACOS should handle the final | ||
# element of the input which will be a boolean operator. For now, we will have stopped processing those by this | ||
# point during the initial logic in `to_tacos` that splits on `;;` and returns if the result is more than one query | ||
query_part_array.slice(2..).join(',') | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest | |
assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name'] | ||
assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence'] | ||
end | ||
|
||
test 'primo searches use the preprocessor to extract actual keywords' do | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-blocking: Would it make sense to have a ticket to re-organize the tests in this file? The two tests added here seem fine, and I see a related test on line 128 that hits the default part of the case statement. Most of the tests are concerned with the logSearchEvent function, but a handful in the middle check the lookupTerm function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels like a ticket that would sit in a backlog forever and then someone might delete it in a few years after realizing we either already fixed it along the way or it wasn't worth it. I'm not saying you can't open a ticket, I just wouldn't encourage it. This feels like something that when someone is working on this suite and feels like cleaning it up they will more than someone would work on it just because we have a backlog ticket. I'd be happy to chat in a team meeting about whether this type of work should get backlog tickets though as I'm open to different perspectives. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant. |
||
post '/graphql', params: { query: '{ | ||
logSearchEvent(sourceSystem: "primo-test", | ||
searchTerm: "any,contains,Super cool search") { | ||
phrase | ||
} | ||
}' } | ||
|
||
json = response.parsed_body | ||
|
||
assert_equal 'Super cool search', json['data']['logSearchEvent']['phrase'] | ||
end | ||
|
||
test 'primo searches use the preprocessor and logs complex queries to a specific term' do | ||
post '/graphql', params: { query: '{ | ||
logSearchEvent(sourceSystem: "primo-test", | ||
searchTerm: "any,contains,Super cool search;;;any,contains,uh oh this is getting complicated") { | ||
phrase | ||
} | ||
}' } | ||
|
||
json = response.parsed_body | ||
|
||
assert_equal 'unhandled complex primo query', json['data']['logSearchEvent']['phrase'] | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# frozen_string_literal: true | ||
|
||
# | ||
require 'test_helper' | ||
|
||
class PreprocessorPrimoTest < ActiveSupport::TestCase | ||
test 'to_tacos returns unhandled for complex queries' do | ||
input = 'any,contains,space;;;any,contains,madness' | ||
|
||
assert_equal('unhandled complex primo query', PreprocessorPrimo.to_tacos(input)) | ||
end | ||
|
||
test 'to_tacos returns unhandled for targeted field queries' do | ||
input = 'title,contains,space' | ||
|
||
assert_equal('unhandled complex primo query', PreprocessorPrimo.to_tacos(input)) | ||
end | ||
|
||
test 'to_tacos returns phrase ready for tacos for simple keyword input' do | ||
input = 'any,contains,space' | ||
|
||
assert_equal('space', PreprocessorPrimo.to_tacos(input)) | ||
end | ||
|
||
test 'to_tacos returns phrase ready for complex keyword input' do | ||
input = 'any,contains,Yan, F., Krantz, P., Sung, Y., Kjaergaard, M., Campbell, D.L., Orlando, T.P., Gustavsson, S. and Oliver, W.D., 2018. Tunable coupling scheme for implementing high-fidelity two-qubit gates. Physical Review Applied, 10(5), p.054062.' | ||
expected = 'Yan, F., Krantz, P., Sung, Y., Kjaergaard, M., Campbell, D.L., Orlando, T.P., Gustavsson, S. and Oliver, W.D., 2018. Tunable coupling scheme for implementing high-fidelity two-qubit gates. Physical Review Applied, 10(5), p.054062.' | ||
|
||
assert_equal(expected, PreprocessorPrimo.to_tacos(input)) | ||
end | ||
|
||
test 'keyword? returns true for any contains phrase pattern' do | ||
input = 'any,contains,popcorn anomoly'.split(',') | ||
|
||
assert(PreprocessorPrimo.keyword?(input)) | ||
end | ||
|
||
test 'keyword? returns false for input with more than 3 array elements' do | ||
# NOTE: this query entering tacos would work... but it would have been cleaned up prior to running | ||
# keyword? in our application via the normal flow | ||
input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above comment about the calling of this method never being able to match the input of this test - we're explicitly composing the argument as a three-element array, if I follow the logic above? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're composing it in one place that way...but if the method is reused elsewhere or there is a refactor that changes how we are calling it, it's important to have tests about our expectations for the method independently (unless it's a private method) to ensure we don't break something by changing this behavior. I'll take a look at this to see if there is a way to make it seem less weird though. I did a sort of late refactor and it's possibly I only half way completed what was in my brain and thus caused this (if you look at the first commit alone and the changes in the second you can probably see how it may have made sense to do it this way at first but might not after the change... I'll check). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, and that's a fair reason to have a test like this. |
||
|
||
assert_not(PreprocessorPrimo.keyword?(input)) | ||
end | ||
|
||
test 'keyword? returns false for input with less than 3 array elements' do | ||
input = 'any,contains'.split(',') | ||
|
||
assert_not(PreprocessorPrimo.keyword?(input)) | ||
end | ||
|
||
test 'keyword? returns false for non-any input' do | ||
input = 'title,contains,popcorn anomoly'.split(',') | ||
|
||
assert_not(PreprocessorPrimo.keyword?(input)) | ||
end | ||
|
||
test 'keyword? returns true for non-contains inputs' do | ||
# NOTE: this portion of they primo query focuses on how to handle the phrase. All the words, any of the words, | ||
# the exact phrase, begins_with. For now we treat them all the same as standard keyword queries. | ||
input = 'any,exact,popcorn anomoly'.split(',') | ||
|
||
assert(PreprocessorPrimo.keyword?(input)) | ||
end | ||
|
||
test 'extract keyword returns keyword for simple keywords' do | ||
input = 'any,contains,popcorn anomoly' | ||
|
||
assert_equal('popcorn anomoly', PreprocessorPrimo.extract_keyword(input)) | ||
end | ||
|
||
test 'extract keyword returns keyword for simple non-contains keywords' do | ||
input = 'any,exact,popcorn anomoly' | ||
|
||
assert_equal('popcorn anomoly', PreprocessorPrimo.extract_keyword(input)) | ||
end | ||
|
||
test 'extract keyword returns unhandled complex primo query for non-any searches' do | ||
input = 'title,contains,popcorn anomoly' | ||
|
||
assert_equal('unhandled complex primo query', PreprocessorPrimo.extract_keyword(input)) | ||
end | ||
|
||
test 'extract keyword returns keyword for keywords with punctuation' do | ||
input = 'any,contains,popcorn anomoly: a cats! life. on & mars!' | ||
|
||
assert_equal('popcorn anomoly: a cats! life. on & mars!', PreprocessorPrimo.extract_keyword(input)) | ||
end | ||
|
||
test 'extract keyword returns keyword for keywords with commas' do | ||
input = 'any,contains,popcorn anomoly, and so can you' | ||
|
||
assert_equal('popcorn anomoly, and so can you', PreprocessorPrimo.extract_keyword(input)) | ||
end | ||
|
||
test 'extract keyword returns keyword for keywords with multiple commas and other punctuation' do | ||
input = 'any,contains,popcorn anomoly: a cats! life. on & mars!, words, of {truth} (and, also not,)' | ||
|
||
assert_equal('popcorn anomoly: a cats! life. on & mars!, words, of {truth} (and, also not,)', | ||
PreprocessorPrimo.extract_keyword(input)) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we are returning the presumably-unique string on line 18, which will allow us to get event counts within TACOS. Is there anything we can (or should) do with these debug messages? I don't see them enabled in the review app, and I doubt they'd be present in production.
I suspect it's no problem to leave them in, and they'll just show up for local development work - but in case I'm wrong I wanted to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will show up in environments in which the log level is set to DEBUG which is the default in development in this app.
In production we default to INFO, but can swap to DEBUG with an ENV change to see something in a PR build (or even in prod) for a bit:
https://github.com/MITLibraries/tacos/blob/main/config/environments/production.rb#L65
More info on Rails log levels
Some of our apps we have put the ENV control of log level in development for when we have a few too many debug logs for normal use. I believe in at least one app we run prod in debug log level mode. Consistency would probably be nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.