Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds preprocessor for incoming primo searches #143

Merged
merged 5 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions app/models/preprocessor_primo.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# frozen_string_literal: true

# PreprocessorPrimo handles manipulating incoming data from the Primo UI into a structure that TACOS can work with
class PreprocessorPrimo
# to_tacos processes raw incoming query from Primo, looks at each part to see if it is a keyword anywhere search
# Any portion that is not a keyword anywhere search drops the entire search from TACOS, logging
# as the shared Term `unhandled complex primo query` to allow us to track how frequently we are
# dropping terms so we can come back later to build out more complex handing if this is common enough
# to warrant the additional work.
# @param query [String] example `any,contains,this is a keyword search`
def self.to_tacos(query)
# Primo and TACOS agreed upon joiner is `;;;`
split_query = query.split(';;;')

if split_query.count > 1
Rails.logger.debug('Multipart primo query detected')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we are returning the presumably-unique string on line 18, which will allow us to get event counts within TACOS. Is there anything we can (or should) do with these debug messages? I don't see them enabled in the review app, and I doubt they'd be present in production.

I suspect it's no problem to leave them in, and they'll just show up for local development work - but in case I'm wrong I wanted to check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will show up in environments in which the log level is set to DEBUG which is the default in development in this app.

In production we default to INFO, but can swap to DEBUG with an ENV change to see something in a PR build (or even in prod) for a bit:
https://github.com/MITLibraries/tacos/blob/main/config/environments/production.rb#L65

More info on Rails log levels

Some of our apps we have put the ENV control of log level in development for when we have a few too many debug logs for normal use. I believe in at least one app we run prod in debug log level mode. Consistency would probably be nice :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.


# As we are not currently handling complex queries, always set the value to something we can track frequency of
'unhandled complex primo query'
else
Rails.logger.debug('Simple primo query detected')

extract_keyword(query)
end
end

# keyword? confirms whether a portion of a primo query is a keyword search
# Note: we expect only 3 elements to this array for simple keyword searches and that arrays created from the Primo
# input to be collapsed so commas in the original search have been handled via the comma_handler method
# @param query_part_array [Array] example ['any', 'contains', 'this is a keyword search']
# @return [Boolean]
def self.keyword?(query_part_array)
return false unless query_part_array.count == 3
return false unless query_part_array[0] == 'any'

# For now, we are allowing all variants of the second portion of the primo query input
# The expected values are: contains, exact, begins_with, equals
# Uncommenting the following statement would allow us to restrict to just the default 'contains' if desireable
#
# return false unless query_part_array[1] == 'contains'

true
end

# extract_keyword works at the level of a single keyword query input coming from primo and
# returns a string with just that keyword with the operators removed
# @param query_part [String] example `any,contains,this is a keyword search`
# @return [String] the extracted keyword phrase
def self.extract_keyword(query_part)
query_part_array = query_part.split(',')

# We don't anticipate this being a normal state so we are tracking it under the Term `invalid primo query` as well
# as sending an exception to Sentry so we can understand the context in which this happens if it does
if query_part_array.count < 3
Sentry.capture_message('PreprocessorPrimo: Invalid Primo query during keyword extraction')
return 'invalid primo query'
end

the_keywords = join_keyword_and_drop_extra_parts(query_part_array)

return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this feels awkward to me. Explicitly composing the array in this way while calling keyword?, and then checking there to confirm that there are only three elements, doesn't seem like it accomplishes anything? If the_keywords doesn't get parsed correctly, the argument is still only a three-element array - but that third element might not be a single string - maybe ['foo', 'bar', ['baz', 'none']] and not ['foo', 'bar', 'baz', 'none'], since the_keywords would only ever be one thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look. There was a refactor involving this that may have been a bit lazy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I'm super confused what's going on and how anything works at all. I'll fix the behavior and/or tests and/or docs and let you know when this is ready... sorry for whatever this mess is!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I updated the docs and names of methods to make this more clear what is going on. If not, please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below.


the_keywords
end

# join_keyword_and_drop_extra_parts handles the logic necessary to join searches that contain commas into a single ruby string
# after we separate the incoming string into an array based on commas
# @param query_part [String] example `['any', 'contains', 'this', 'is', 'a', 'keyword', 'search']`
# @return [String] example 'this,is,a,keyword,search'
def self.join_keyword_and_drop_extra_parts(query_part_array)
# For complex queries, which we are not handling yet, we'll need to determine how TACOS should handle the final
# element of the input which will be a boolean operator. For now, we will have stopped processing those by this
# point during the initial logic in `to_tacos` that splits on `;;` and returns if the result is more than one query
query_part_array.slice(2..).join(',')
end
end
18 changes: 17 additions & 1 deletion app/models/search_logger.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,24 @@ class SearchLogger
# Receives a phrase and source and creates a search event. Will find or create a term as needed.
# @return [SearchEvent] the newly created SearchEvent
def self.logevent(phrase, source)
term = Term.create_or_find_by!(phrase:)
term = Term.create_or_find_by!(phrase: extract_phrase(phrase, source))
term.calculate_categorizations
term.search_events.create!(source:)
end

# Coordinates `phrase` extraction from incoming data from each `source`. If no `source` is matched,
# passes through incoming `phrase`.
# Note: as it may become useful to test in a production environment, we match on patterns of sources
# rather than exact string matches. Example: `primo`, `primo-testing`, `primo-playground` are all handled
# with the same case.
def self.extract_phrase(phrase, source)
case source
when /primo/
Rails.logger.debug('Primo case detected')
PreprocessorPrimo.to_tacos(phrase)
else
Rails.logger.debug('default case detected')
phrase
end
end
end
26 changes: 26 additions & 0 deletions test/controllers/graphql_controller_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name']
assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence']
end

test 'primo searches use the preprocessor to extract actual keywords' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Would it make sense to have a ticket to re-organize the tests in this file? The two tests added here seem fine, and I see a related test on line 128 that hits the default part of the case statement. Most of the tests are concerned with the logSearchEvent function, but a handful in the middle check the lookupTerm function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a ticket that would sit in a backlog forever and then someone might delete it in a few years after realizing we either already fixed it along the way or it wasn't worth it. I'm not saying you can't open a ticket, I just wouldn't encourage it. This feels like something that when someone is working on this suite and feels like cleaning it up they will more than someone would work on it just because we have a backlog ticket. I'd be happy to chat in a team meeting about whether this type of work should get backlog tickets though as I'm open to different perspectives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant.

post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "primo-test",
searchTerm: "any,contains,Super cool search") {
phrase
}
}' }

json = response.parsed_body

assert_equal 'Super cool search', json['data']['logSearchEvent']['phrase']
end

test 'primo searches use the preprocessor and logs complex queries to a specific term' do
post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "primo-test",
searchTerm: "any,contains,Super cool search;;;any,contains,uh oh this is getting complicated") {
phrase
}
}' }

json = response.parsed_body

assert_equal 'unhandled complex primo query', json['data']['logSearchEvent']['phrase']
end
end
102 changes: 102 additions & 0 deletions test/models/preprocessor_primo_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# frozen_string_literal: true

#
require 'test_helper'

class PreprocessorPrimoTest < ActiveSupport::TestCase
test 'to_tacos returns unhandled for complex queries' do
input = 'any,contains,space;;;any,contains,madness'

assert_equal('unhandled complex primo query', PreprocessorPrimo.to_tacos(input))
end

test 'to_tacos returns unhandled for targeted field queries' do
input = 'title,contains,space'

assert_equal('unhandled complex primo query', PreprocessorPrimo.to_tacos(input))
end

test 'to_tacos returns phrase ready for tacos for simple keyword input' do
input = 'any,contains,space'

assert_equal('space', PreprocessorPrimo.to_tacos(input))
end

test 'to_tacos returns phrase ready for complex keyword input' do
input = 'any,contains,Yan, F., Krantz, P., Sung, Y., Kjaergaard, M., Campbell, D.L., Orlando, T.P., Gustavsson, S. and Oliver, W.D., 2018. Tunable coupling scheme for implementing high-fidelity two-qubit gates. Physical Review Applied, 10(5), p.054062.'
expected = 'Yan, F., Krantz, P., Sung, Y., Kjaergaard, M., Campbell, D.L., Orlando, T.P., Gustavsson, S. and Oliver, W.D., 2018. Tunable coupling scheme for implementing high-fidelity two-qubit gates. Physical Review Applied, 10(5), p.054062.'

assert_equal(expected, PreprocessorPrimo.to_tacos(input))
end

test 'keyword? returns true for any contains phrase pattern' do
input = 'any,contains,popcorn anomoly'.split(',')

assert(PreprocessorPrimo.keyword?(input))
end

test 'keyword? returns false for input with more than 3 array elements' do
# NOTE: this query entering tacos would work... but it would have been cleaned up prior to running
# keyword? in our application via the normal flow
input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment about the calling of this method never being able to match the input of this test - we're explicitly composing the argument as a three-element array, if I follow the logic above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're composing it in one place that way...but if the method is reused elsewhere or there is a refactor that changes how we are calling it, it's important to have tests about our expectations for the method independently (unless it's a private method) to ensure we don't break something by changing this behavior.

I'll take a look at this to see if there is a way to make it seem less weird though. I did a sort of late refactor and it's possibly I only half way completed what was in my brain and thus caused this (if you look at the first commit alone and the changes in the second you can probably see how it may have made sense to do it this way at first but might not after the change... I'll check).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and that's a fair reason to have a test like this.


assert_not(PreprocessorPrimo.keyword?(input))
end

test 'keyword? returns false for input with less than 3 array elements' do
input = 'any,contains'.split(',')

assert_not(PreprocessorPrimo.keyword?(input))
end

test 'keyword? returns false for non-any input' do
input = 'title,contains,popcorn anomoly'.split(',')

assert_not(PreprocessorPrimo.keyword?(input))
end

test 'keyword? returns true for non-contains inputs' do
# NOTE: this portion of they primo query focuses on how to handle the phrase. All the words, any of the words,
# the exact phrase, begins_with. For now we treat them all the same as standard keyword queries.
input = 'any,exact,popcorn anomoly'.split(',')

assert(PreprocessorPrimo.keyword?(input))
end

test 'extract keyword returns keyword for simple keywords' do
input = 'any,contains,popcorn anomoly'

assert_equal('popcorn anomoly', PreprocessorPrimo.extract_keyword(input))
end

test 'extract keyword returns keyword for simple non-contains keywords' do
input = 'any,exact,popcorn anomoly'

assert_equal('popcorn anomoly', PreprocessorPrimo.extract_keyword(input))
end

test 'extract keyword returns unhandled complex primo query for non-any searches' do
input = 'title,contains,popcorn anomoly'

assert_equal('unhandled complex primo query', PreprocessorPrimo.extract_keyword(input))
end

test 'extract keyword returns keyword for keywords with punctuation' do
input = 'any,contains,popcorn anomoly: a cats! life. on & mars!'

assert_equal('popcorn anomoly: a cats! life. on & mars!', PreprocessorPrimo.extract_keyword(input))
end

test 'extract keyword returns keyword for keywords with commas' do
input = 'any,contains,popcorn anomoly, and so can you'

assert_equal('popcorn anomoly, and so can you', PreprocessorPrimo.extract_keyword(input))
end

test 'extract keyword returns keyword for keywords with multiple commas and other punctuation' do
input = 'any,contains,popcorn anomoly: a cats! life. on & mars!, words, of {truth} (and, also not,)'

assert_equal('popcorn anomoly: a cats! life. on & mars!, words, of {truth} (and, also not,)',
PreprocessorPrimo.extract_keyword(input))
end
end
Loading