Adds preprocessor for incoming primo searches #143

JPrevost · 2024-11-20T16:41:56Z

Why are these changes being introduced:

Primo has a more complex keyword search capability than tacos is able to understand
Rather than adding complexity to what tacos understands to match primo, we are adding a normalization step via source preprocessors

Relevant ticket(s):

https://mitlibraries.atlassian.net/browse/TCO-113

How does this address that need:

This adds our first source preprocessor to handle incoming searches from primo ui
It does not yet handle complex searches (targeted field searching or multiple keywords being combined with boolean logic)
When a complex search is detected, we attach it to a shared term 'unhandled complex primo query' so we can better understand the volume of these more complex queries. This should provide us with some initial data to help us prioritize handling more complex queries within tacos.

Summary of changes (please refer to commit messages for full details)

Developer

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

NO migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

matt-bernhardt

Most of my comments here are to confirm my assumptions about the change, but there are two things that I suspect might need a tweak - but I'm open to pushback or hearing that I'm off base.

I don't think the check for three elements in the keyword? method is doing anything, given how we are calling that method
I think the comment in the comma_handlermethod isn't accurate in its example

Everything else seems fine, and I particularly like the structure of the extract_phrase method in SearchLogger - that feels nicely extensible for other systems to start contributing, while also managing non-production environments.

My last comment about re-organizing the GraphQL controller test file is non-blocking - if you agree that it seems worth doing, then I'm happy to make a ticket in the backlog for it.

matt-bernhardt · 2024-11-22T15:46:13Z

test/controllers/graphql_controller_test.rb

@@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
    assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name']
    assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence']
  end
+
+  test 'primo searches use the preprocessor to extract actual keywords' do


Non-blocking: Would it make sense to have a ticket to re-organize the tests in this file? The two tests added here seem fine, and I see a related test on line 128 that hits the default part of the case statement. Most of the tests are concerned with the logSearchEvent function, but a handful in the middle check the lookupTerm function.

This feels like a ticket that would sit in a backlog forever and then someone might delete it in a few years after realizing we either already fixed it along the way or it wasn't worth it. I'm not saying you can't open a ticket, I just wouldn't encourage it. This feels like something that when someone is working on this suite and feels like cleaning it up they will more than someone would work on it just because we have a backlog ticket. I'd be happy to chat in a team meeting about whether this type of work should get backlog tickets though as I'm open to different perspectives.

Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant.

matt-bernhardt · 2024-11-22T15:49:28Z

app/models/preprocessor_primo.rb

+    split_query = query.split(';;')
+
+    if split_query.count > 1
+      Rails.logger.debug('Multipart primo query detected')


I see that we are returning the presumably-unique string on line 18, which will allow us to get event counts within TACOS. Is there anything we can (or should) do with these debug messages? I don't see them enabled in the review app, and I doubt they'd be present in production.

I suspect it's no problem to leave them in, and they'll just show up for local development work - but in case I'm wrong I wanted to check.

They will show up in environments in which the log level is set to DEBUG which is the default in development in this app.

In production we default to INFO, but can swap to DEBUG with an ENV change to see something in a PR build (or even in prod) for a bit:
https://github.com/MITLibraries/tacos/blob/main/config/environments/production.rb#L65

More info on Rails log levels

Some of our apps we have put the ENV control of log level in development for when we have a few too many debug logs for normal use. I believe in at least one app we run prod in debug log level mode. Consistency would probably be nice :)

Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.

matt-bernhardt · 2024-11-22T16:18:34Z

app/models/preprocessor_primo.rb

+  # after we separate the incoming string into an array based on commas
+  def self.comma_handler(query_part_array)
+    # Join the third to the end of the into a string and join by commas
+    # ex: any,contains,I,am,a,search,with,lots,of,commas -> I am a search with lots of commas


I don't think this comment is accurate? When I submit the string any,contains,I,am,a,search,with,lots,of,commas I get back I,am,a,search,with,lots,of,commas from the API - which is the value I expected based on line 67 and the tests down below.

I'll take a look and correct the example as appropriate. This is likely because I added a second commit to change where the collapse of a multi-comma'd search gets handled and didn't update the comment. Good catch!

matt-bernhardt · 2024-11-22T16:21:26Z

app/models/preprocessor_primo.rb

+  def self.extract_keyword(query_part)
+    query_part_array = query_part.split(',')
+
+    return 'invalid primo query' unless query_part_array.count >= 3


We are returning a different value from this clause because we expect this condition to be erroneous, rather than an expected behavior that we need to gauge frequencies for?

Yes. And similarly to the feedback I provided to you on your open PR, maybe this means we log an exception to Sentry as we don't expect this to happen... it's literally exceptional (unless it isn't because it happens to much at which point we don't understand the system and need to fix it)

I'll update this to have a Sentry log before merging.

I've update to send event to Sentry if this unexpected condition happens so we can better understand it

matt-bernhardt · 2024-11-22T18:25:09Z

app/models/preprocessor_primo.rb

+
+    the_keywords = comma_handler(query_part_array)
+
+    return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords])


Something about this feels awkward to me. Explicitly composing the array in this way while calling keyword?, and then checking there to confirm that there are only three elements, doesn't seem like it accomplishes anything? If the_keywords doesn't get parsed correctly, the argument is still only a three-element array - but that third element might not be a single string - maybe ['foo', 'bar', ['baz', 'none']] and not ['foo', 'bar', 'baz', 'none'], since the_keywords would only ever be one thing?

I'll take a look. There was a refactor involving this that may have been a bit lazy.

Yeah... I'm super confused what's going on and how anything works at all. I'll fix the behavior and/or tests and/or docs and let you know when this is ready... sorry for whatever this mess is!

I think I updated the docs and names of methods to make this more clear what is going on. If not, please let me know.

Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below.

matt-bernhardt · 2024-11-22T18:29:14Z

test/models/preprocessor_primo_test.rb

+  test 'keyword? returns false for input with more than 3 array elements' do
+    # NOTE: this query entering tacos would work... but it would have been cleaned up prior to running
+    # keyword? in our application via the normal flow
+    input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',')


See above comment about the calling of this method never being able to match the input of this test - we're explicitly composing the argument as a three-element array, if I follow the logic above?

We're composing it in one place that way...but if the method is reused elsewhere or there is a refactor that changes how we are calling it, it's important to have tests about our expectations for the method independently (unless it's a private method) to ensure we don't break something by changing this behavior.

I'll take a look at this to see if there is a way to make it seem less weird though. I did a sort of late refactor and it's possibly I only half way completed what was in my brain and thus caused this (if you look at the first commit alone and the changes in the second you can probably see how it may have made sense to do it this way at first but might not after the change... I'll check).

Good point, and that's a fair reason to have a test like this.

JPrevost · 2024-11-26T20:13:01Z

@matt-bernhardt I've updated the documentation and made some other small changes. Please take a look when you can to see if this fully addresses your concerns.

matt-bernhardt

Thanks for talking through my questions, for adding a Sentry call on the invalid query branch, and updating the code comments / method names. I think all my concerns have been addressed by this point.

Looking forward to seeing this in production, and continuing to iterate on how to receive traffic from multiple sources. This puts us on a good foundation for that sort of work.

matt-bernhardt · 2024-12-03T20:20:21Z

app/models/preprocessor_primo.rb

+  # @param query [String] example `any,contains,this is a keyword search`
+  def self.to_tacos(query)
+    # split on agreed upon joiner `;;`
+    split_query = query.split(';;')


Given the filtering that happens before this, I don't think anything needs an adjustment - but I just peeked at our data and there are about a dozen examples of terms already that have ;; in them. All of these are obviously from not-Primo, and that traffic would not get routed to this model, but it might be a pathway to keep in mind as we investigate anything marked as "unhandled complex primo query"

Hmmm. I'll take a closer look and see if there is a better separator to choose that is less likely to exist in an incoming search (I chose this, it's not something Primo provided by default so we can adjust it). Thanks for noticing that.

;;; seems fine. I've asked AdamS to update the primo integration to use that and will push that change here as well.

matt-bernhardt · 2024-12-03T20:21:01Z

app/models/preprocessor_primo.rb

+    split_query = query.split(';;')
+
+    if split_query.count > 1
+      Rails.logger.debug('Multipart primo query detected')


Yeah, consistency would be good - but lack of it here isn't a reason for anything to change yet. Thanks for confirming that these won't cause an issue for us in prod.

matt-bernhardt · 2024-12-03T20:22:20Z

app/models/preprocessor_primo.rb

+
+    the_keywords = comma_handler(query_part_array)
+
+    return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords])


Thanks - I think the names now work a little better. I've got other thoughts about the test, but will share them below.

matt-bernhardt · 2024-12-03T20:23:30Z

test/controllers/graphql_controller_test.rb

@@ -214,4 +214,30 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
    assert_equal 'Transactional', json['data']['lookupTerm']['categories'].first['name']
    assert_in_delta 0.95, json['data']['lookupTerm']['categories'].first['confidence']
  end
+
+  test 'primo searches use the preprocessor to extract actual keywords' do


Works for me - if it starts bugging me more, I can always set up a quick re-organization of the test file separate from other work, at which point a ticket would be relevant.

matt-bernhardt · 2024-12-03T20:24:39Z

test/models/preprocessor_primo_test.rb

+  test 'keyword? returns false for input with more than 3 array elements' do
+    # NOTE: this query entering tacos would work... but it would have been cleaned up prior to running
+    # keyword? in our application via the normal flow
+    input = 'any,contains,popcorn anomoly: why life on the moon is complex, and other cat facts'.split(',')


Good point, and that's a fair reason to have a test like this.

Why are these changes being introduced: * Primo has a more complex keyword search capability than tacos is able to understand * Rather than adding complexity to what tacos understands to match primo, we are adding a normalization step via source preprocessors Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-113 How does this address that need: * This adds our first source preprocessor to handle incoming searches from primo ui * It does not yet handle complex searches (targeted field searching or multiple keywords being combined with boolean logic) * When a complex search is detected, we attach it to a shared term 'unhandled complex primo query' so we can better understand the volume of these more complex queries. This should provide us with some initial data to help us prioritize handling more complex queries within tacos.

Adds some additional methods and documentation to clarify why some things work as they do and where a few future pitfalls exist Removes some unnecessary comments that were unintentionally included previously

mitlib temporarily deployed to tacos-api-pipeline-pr-143 November 20, 2024 16:43 Inactive

JPrevost requested review from matt-bernhardt and jazairi November 21, 2024 13:15

JPrevost temporarily deployed to tacos-api-pipeline-pr-143 November 21, 2024 13:49 Inactive

matt-bernhardt self-assigned this Nov 22, 2024

matt-bernhardt requested changes Nov 22, 2024

View reviewed changes

JPrevost temporarily deployed to tacos-api-pipeline-pr-143 November 22, 2024 21:29 Inactive

JPrevost force-pushed the tco-113-primo-preprocessor branch from 6a4ff24 to e173db0 Compare November 26, 2024 20:11

JPrevost temporarily deployed to tacos-api-pipeline-pr-143 November 26, 2024 20:11 Inactive

JPrevost requested a review from matt-bernhardt November 26, 2024 20:13

mitlib temporarily deployed to tacos-api-pipeline-pr-143 December 2, 2024 20:40 Inactive

matt-bernhardt approved these changes Dec 3, 2024

View reviewed changes

JPrevost added 5 commits December 5, 2024 15:49

Minor refactor of some logic

e41f3f2

Adds some additional methods and documentation to clarify why some things work as they do and where a few future pitfalls exist Removes some unnecessary comments that were unintentionally included previously

Docs and naming fixups

ab886fb

Log invalid primo queries to Sentry

6cace05

Update primo/tacos joiner to ;;;

ff94e13

JPrevost force-pushed the tco-113-primo-preprocessor branch from e173db0 to ff94e13 Compare December 5, 2024 20:52

JPrevost temporarily deployed to tacos-api-pipeline-pr-143 December 5, 2024 20:52 Inactive

JPrevost merged commit 5fbfd28 into main Dec 5, 2024
6 checks passed

JPrevost deleted the tco-113-primo-preprocessor branch December 5, 2024 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds preprocessor for incoming primo searches #143

Adds preprocessor for incoming primo searches #143

JPrevost commented Nov 20, 2024

matt-bernhardt left a comment

matt-bernhardt Nov 22, 2024

JPrevost Nov 22, 2024

matt-bernhardt Dec 3, 2024

matt-bernhardt Nov 22, 2024

JPrevost Nov 22, 2024

matt-bernhardt Dec 3, 2024

matt-bernhardt Nov 22, 2024

JPrevost Nov 22, 2024

matt-bernhardt Nov 22, 2024

JPrevost Nov 22, 2024

JPrevost Nov 22, 2024

JPrevost Nov 26, 2024

matt-bernhardt Nov 22, 2024

JPrevost Nov 22, 2024

JPrevost Nov 22, 2024

JPrevost Nov 22, 2024

matt-bernhardt Dec 3, 2024

matt-bernhardt Nov 22, 2024

JPrevost Nov 22, 2024

matt-bernhardt Dec 3, 2024

JPrevost commented Nov 26, 2024

matt-bernhardt left a comment

matt-bernhardt Dec 3, 2024

JPrevost Dec 4, 2024

JPrevost Dec 5, 2024

matt-bernhardt Dec 3, 2024

matt-bernhardt Dec 3, 2024

matt-bernhardt Dec 3, 2024

matt-bernhardt Dec 3, 2024


		the_keywords = comma_handler(query_part_array)

		return 'unhandled complex primo query' unless keyword?([query_part_array[0], query_part_array[1], the_keywords])

Adds preprocessor for incoming primo searches #143

Adds preprocessor for incoming primo searches #143

Conversation

JPrevost commented Nov 20, 2024

Developer

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

matt-bernhardt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JPrevost commented Nov 26, 2024

matt-bernhardt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment