Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Detections model #107

Merged
merged 4 commits into from
Sep 20, 2024
Merged

Implement Detections model #107

merged 4 commits into from
Sep 20, 2024

Conversation

matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Sep 18, 2024

This defines the Detections model, which links our Term and Detector records. The details have gotten a bit messier than I expected, and some changes to the class diagram may be necessary. This PR updates that diagram to help make it clear what's changing:

First, the confidence value is no longer stored as part of the Detections record. Because a Detector might be mapped to multiple Categories, each with its own confidence value, but the Detections record has no relationship to Category, there's no way to comprehensively cache the appropriate confidence value here. The values will still be used during the Categorization process, but we'll have to look them up in the graph rather than cache them with the Detection.

Second, the various record_* methods are defined in the Term model, and not in the Detection model.

I've tried to write thorough tests confirming the various constraints, and walking through the intended uses of this model. I'm not sure that I've gotten everything, but hopefully nothing is glaringly wrong.

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-33

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

** Why are these changes being introduced:

* The application needs a way to record the outcome of the application's
  Detectors when they review the Terms we receive. These outcomes will
  be needed in a subsequent step, the application tries to place these
  Terms into the various Categories.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-33

** How does this address that need:

This defines a Detections model, which is a linkage between a Term and
a Detector. In order to allow the application's behavior to improve over
time, we also define a DETECTOR_VERSION value in the environment, which
will be updated as the application becomes more capable.

Detections are the unique combination of a Term, a Detector, for a given
DETECTOR_VERSION.

The Detection model includes several scopes, which are consulted during
the before_create lifecycle hook to prevent duplicate records from being
created. I've attempted to write tests to indicate how this works.

** Document any side effects to this change:

This change deviates from our earlier plans in two notable ways:

* The confidence value which we originally intended to store with the
  Detection has been abandoned. Because a Detector might be linked
  with multiple Categories, each with a differing confidence value,
  there is no way to record a Term-Detector confidence value without
  losing information. The confidence values will still be used during
  Categorization, but the values cannot be cached in the Detection
  records.

* The various record_* methods are no longer defined as part of the
  Detection class. Because a single Term will be joined to multiple
  Detection records, it made more sense for me to define the record_*
  methods on the Term class instead.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-107 September 19, 2024 14:48 Inactive
@matt-bernhardt matt-bernhardt marked this pull request as ready for review September 19, 2024 15:01
app/models/detection.rb Outdated Show resolved Hide resolved
app/models/term.rb Outdated Show resolved Hide resolved
Jeremy pointed out during code review that it might be worth using
find_or_create_by in the Term methods, rather than a combination of the
create method and a guard method in the Detection model.

This makes that change, adopting find_or_create_by inside Term and
removing the check_for_current method from the Detection model.

Along the way we update the specific assertion being looked for in the
Detection tests. Please note that there are tests in the Term model that
confirm the record_* methods an be called repeatedly without either
creating duplicate records, or throwing errors.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-107 September 19, 2024 19:07 Inactive
@matt-bernhardt
Copy link
Member Author

Okay, @JPrevost - I think this is ready for another review. I've elected to keep the scopes in place - I think there's going to be a clear need for the current and for_term scopes during categorization. I'm less certain about the for_detector scope, but the symmetry with for_term makes a certain sense to me at this point.

@JPrevost JPrevost self-assigned this Sep 19, 2024
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I believe this works. I have two comments (one small and one potential refactor). Neither are blocking so I am approving. If you choose to make any changes please dismiss my review when you are ready for a re-review.

app/models/term.rb Outdated Show resolved Hide resolved
app/models/term.rb Outdated Show resolved Hide resolved
app/models/detection.rb Show resolved Hide resolved
app/models/detection.rb Outdated Show resolved Hide resolved
app/models/detector.rb Outdated Show resolved Hide resolved
test/models/detection_test.rb Outdated Show resolved Hide resolved
- Explicitly return nil in Term methods where no return is needed
- Update code comments for Detection and Detector classes
- Use ClimateControl to modify environment when testing
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-107 September 19, 2024 22:13 Inactive
** Why are these changes being introduced:

The original proposal for implementing the Detection workflow involved
a series of record_* methods in the Term model, which then called
methods across the application to coordinate the creation of needed
records.

During code review, a more modular arrangement was suggested, where the
record_* methods would be distributed in the Detector subclasses, where
the work was actually being done. Those subclasses would then each
manage their own Detection record creation.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-33

** How does this address that need:

This makes that change, moving the work of recording Detections to the
Detector subclasses. The tests are refactored as well, to match this
arrangement.

** Document any side effects to this change:

Some of the record methods feel very close to being repetetive, which
makes me wonder whether we should add them to the base Detector class,
and allow overrides when a specific detector needs a different logic
(for example, the Journal and SuggestedResource look up against a list,
while StandardIdentifiers might provide its own regex-based record
method).
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-107 September 20, 2024 15:11 Inactive
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. record will likely always hurt my brain as I'm also thinking of that as a noun and not a verb. Hopefully we get used to it and if not we can change it.

@matt-bernhardt matt-bernhardt merged commit 9fca519 into main Sep 20, 2024
3 checks passed
@matt-bernhardt matt-bernhardt deleted the tco-33 branch September 20, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants