Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Duplicate Detection #2749

Merged
merged 23 commits into from
Mar 4, 2024

Conversation

kshepherd
Copy link
Member

@kshepherd kshepherd commented Jan 18, 2024

References

The initial work on this feature was supported by TU Berlin. This work was supported by FHNW and ZHAW.

Description

This feature introduces basic Duplicate Detection into DSpace submission and item links, using Solr's ability to search by levenshtein distance.

A new solr field item_signature is indexed for in-progress and archived items, with a normalised version of the title to allow it to be treated as a single term using the ~n operator, where n is a configurable edit distance.

The potential duplicates detected by this search will be displayed to submitters or reviewers as a submission step, for their information (they could then choose to ignore the report if they know this is a new different item, or cancel their submission if there is a legitimate duplicate already present).

If a potential duplicate is in workspace, only the submitter of an item can see this information.

Workflow reviewers are given a small hint indicating how many potential duplicates are detected for a pooled or claimed task, and directing them to the 'Edit' (form) where the step will show details. If the potential duplicate is in workflow, only a reviewer or admin will be able to see this information.

If versioning is used, only the latest version is included in comparison.

Instructions for Reviewers

  • Make sure you are using the backend PR Basic Duplicate Detection in submission and workflow DSpace#9265
  • Configure submission to include the new duplicate step
  • Enable duplication in dspace/config/modules/duplicate-detection.cfg and configure as desired
  • Add items with the exact same (or play around with edit distance) titles, and see what shows up in the Duplicate section

Checklist

This checklist provides a reminder of what we are going to look for when reviewing your PR. You need not complete this checklist prior to creating your PR (draft PRs are always welcome). If you are unsure about an item in the checklist, don't hesitate to ask. We're here to help!

  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & specs/tests), or I have provided reasons as to why that's not possible.
  • My PR passes ESLint validation using yarn lint
  • My PR doesn't introduce circular dependencies (verified via yarn check-circ-deps)
  • My PR includes TypeDoc comments for all new (or modified) public methods and classes. It also includes TypeDoc for large or complex private methods.
  • My PR passes all specs/tests and includes new/updated specs or tests based on the Code Testing Guide.
  • If my PR includes new features or configurations, I've provided basic technical documentation in the PR itself.
  • If my PR fixes an issue ticket, I've linked them together.

@tdonohue tdonohue requested a review from MW3000 February 1, 2024 15:48
Copy link
Contributor

@MW3000 MW3000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as intended also using the new search endpoint.

@kshepherd
Copy link
Member Author

@MW3000 thank you for the further review. I have fixed the issues identified.

Copy link

Hi @kshepherd,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

@pnbecker pnbecker force-pushed the DSpace_detect_duplicates_PR branch from 3a34a16 to a362c6b Compare February 15, 2024 15:14
Copy link
Contributor

@atarix83 atarix83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kshepherd for this PR.

I've reviewed this PR and found out this is not ported as is from implemented on DSpace-CRIS.

I agree to keep it simple than the most comprehensive functionality that is present on DSpace-CRIS, but i've couple of suggestions i'd like to be ported here in this PR:

Moreover in console there is a repetitive error which occurs when the submission is deposited due to the detect duplicate step

Schermata del 2024-02-15 19-55-21

@pnbecker
Copy link
Member

@atarix83 thank you for the review! Yes, this isn‘t a port anymore. As the port didn‘t got reviewed for several DSpace versions, we reimplemented this to get much smaller PRs. We hoped to get it reviewed this time by reducing the PR to roughly 20% of the original size.

@pnbecker
Copy link
Member

@atarix83 we addressed your feedback. Can you please confirm?

@pnbecker pnbecker requested a review from atarix83 February 19, 2024 06:02
@kshepherd
Copy link
Member Author

@atarix83 thanks for your review, I have addressed your feedback as below:

  1. Warning alert panel for "potential duplicates" message in submission section, claimed task and pooled task

screenshot_2024-02-19_16:49:02_selection

  1. Section data is not assumed to exist in the component template so those errors reading from missing section data will not occur
  2. I made the default behaviour of the submission section to hide the duplicate section (at form load and after parsing save response, using very similar approach to the CRIS code you linked) when there are no duplicates to display.

However, if the angular env config submission.duplicateDetection.alwaysShowSection is set to true (default false) then the section will show the "no duplicates" message, in case that fits the user requirements. The message is displayed in an info alert panel:

screenshot_2024-02-19_16:48:52_selection
(ignore that black border, that is my screenshot utility, not the actual web display)

Copy link
Contributor

@atarix83 atarix83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pnbecker @kshepherd

i'm sorry i didn't realize that the rest part is fully changed and my thought was that only the Angular interface has been simplified.

also based on what @abollini pointed out in this comment DSpace/RestContract#252 (review) I'd not approve this PR.

There are any changes the old porting PR is resumed?

@tdonohue tdonohue self-requested a review February 23, 2024 14:41
@kshepherd
Copy link
Member Author

kshepherd commented Mar 3, 2024

force pushed a large rebase (lots of conflicts with modules now that other submission PRs are pushed and this one has so many commits), and includes changes to use searchBy endpoint again as per last reversion/change in backend PR

@kshepherd kshepherd force-pushed the DSpace_detect_duplicates_PR branch from 5f1e082 to 1206d61 Compare March 3, 2024 22:52
@pnbecker pnbecker requested a review from tdonohue March 4, 2024 01:36
@kshepherd
Copy link
Member Author

@tdonohue although code cov looked ok after i made my last changes, i also added new tests for the data service and the object reducer / actions, to keep things in line with general good practice and get coverage up a bit more, hope that works well

Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thank you @kshepherd ! I retested this today and it works perfectly from the Submission form & the Workflow tasks page. I've also verified that errors no longer appear in DevTools when this feature is disabled. The new specs/tests also look good & bring this up to our CodeCov requirements

@tdonohue tdonohue dismissed atarix83’s stale review March 4, 2024 15:28

Verified feedback has been addressed

@tdonohue tdonohue merged commit f6f3962 into DSpace:main Mar 4, 2024
13 checks passed
@tdonohue
Copy link
Member

tdonohue commented Mar 4, 2024

Flagging as needs documentation until Documentation is added for this new feature to the pre-8.0 docs: https://wiki.lyrasis.org/display/DSDOC8x/Submission+User+Interface

@tdonohue tdonohue added the needs documentation PR is missing documentation. All new features and config changes require documentation. label Mar 4, 2024
@tdonohue
Copy link
Member

Removing "needs documentation" flag as docs now exist in https://wiki.lyrasis.org/pages/viewpage.action?pageId=328958055

@tdonohue tdonohue removed the needs documentation PR is missing documentation. All new features and config changes require documentation. label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

6 participants