Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest: map publisher tag to distributorName #9013

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

tcoupin
Copy link
Member

@tcoupin tcoupin commented Oct 3, 2022

What this PR does / why we need it:

Which issue(s) this PR closes:

@mreekie mreekie added the bk2211 label Nov 1, 2022
@mreekie mreekie removed the bk2211 label Jan 11, 2023
@pdurbin pdurbin added the Size: 10 A percentage of a sprint. 7 hours. label Feb 28, 2024
@scolapasta
Copy link
Contributor

If you are still interested in this PR, can you please merge and resolve any merge conflicts with the latest from develop? If so, we can prioritize reviewing and QAing the changes. If we don’t hear from you by May 22, 2024, we’ll go ahead and close this PR (it can always be reopened after that date, if there is still interest).

@plecor plecor force-pushed the 8739-publisher-during-harvesting branch from 95d9c69 to d92d048 Compare May 3, 2024 08:10
@plecor
Copy link
Contributor

plecor commented May 3, 2024

The branch is now up to date with develop and the sql migration file was renamed to match the latest version - please tell me if there is any convention I'm missing.

@pdurbin
Copy link
Member

pdurbin commented May 3, 2024

@plecor can you please add a release note snippet that describes what this pull request does? Please see https://guides.dataverse.org/en/6.2/developers/version-control.html#writing-release-note-snippets

Also, it would be nice to add something to the guides but I'm not sure where. Maybe we should add a changelog to https://guides.dataverse.org/en/6.2/admin/harvestserver.html ? Would that make sense?

Finally, tests are always nice but not strictly required. There are some in HarvestingServerIT.java.

@plecor
Copy link
Contributor

plecor commented May 14, 2024

Thanks @pdurbin.

I created a release note and added the relevant test to HarvestingClientsIT.
I'm really not sure however where this would fit in the docs.

@pdurbin
Copy link
Member

pdurbin commented May 14, 2024

@plecor I went ahead and added a new changelog for harvesting clients: https://dataverse-guide--9013.org.readthedocs.build/en/9013/admin/harvestclients.html#harvesting-client-changelog

In review we'll see what others think.

@coveralls
Copy link

coveralls commented May 14, 2024

Coverage Status

coverage: 22.572%. remained the same
when pulling c8499ba on tcoupin:8739-publisher-during-harvesting
into a36db2d on IQSS:develop.

@pdurbin pdurbin added the Type: Feature a feature request label Oct 9, 2024
@cmbz cmbz added GREI 3 Search and Browse Feature: Harvesting FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) labels Nov 18, 2024
@pdurbin pdurbin self-assigned this Nov 25, 2024
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some feedback.

doc/sphinx-guides/source/admin/harvestclients.rst Outdated Show resolved Hide resolved
@@ -299,6 +299,11 @@ private void harvestingClientRun(boolean allowHarvestingMissingCVV) throws Inte
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's been so long that https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-9013/7/display/redirect shows a failure but that job is now a 404 so I can't see any details.

After you merge the latest from develop let's keep an eye on the new Jenkins run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still seeing failures. I just kicked off another run. Fingers crossed: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-9013/12/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time all checks were green except for "continuous-integration/jenkins/pr-merge " that was pending. I'm not clear on what it is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Now that the app is being deployed following the SQL script renaming, I can see that a couple harvesting-related tests are failing:

  • edu.harvard.iq.dataverse.api.HarvestingClientsIT.testHarvestingClientRun_AllowHarvestingMissingCVV_False(HarvestingClientsIT.java:187)
  • edu.harvard.iq.dataverse.api.HarvestingClientsIT.testHarvestingClientRun_AllowHarvestingMissingCVV_True(HarvestingClientsIT.java:191)

@plecor can you please take a look? Do you need help with how to run these tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pdurbin ,

I've re-run the tests from scratch and I wonder if it could be a chicken and egg situation. The new assertion fails if I run the tests from this branch because the sql migration file it's adding hasn't been run. If I manually update the db then the tests run fine. Could it be something similar going on with Jenkins?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the issue is that in a new install, the corresponding data is initialized by afterMigrate__1-7256-upsert-referenceData.sql after the other migrations, so the data we're trying to update doesn't exist yet.

I updated this file with the same change as the sql migration file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Thanks.

@poikilotherm judging from https://github.com/IQSS/dataverse/blame/c8499ba9553ac46cf3adc56d1b9e56f0c781d30f/src/main/resources/db/migration/afterMigrate__1-7256-upsert-referenceData.sql you added that "after migrate" file. What do you think? Any risk in changing it? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers took a look and had this to say:

"that afterMigrate script appears to run after everything else, every time you restart, rather than being one of the scripts tracked by hash value in the flyway_schema_history table. So - it looks OK to me to update it. (Mostly from looking at https://documentation.red-gate.com/fd/callback-concept-184127466.html and related and verifying that I don't see it in the flyway_schema_history table)."

doc/release-notes/8739-publisher-during-harvesting.md Outdated Show resolved Hide resolved
@@ -0,0 +1 @@
update foreignmetadatafieldmapping set datasetfieldname = 'distributorName' where foreignfieldxpath = ':publisher';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about datasets that are already harvested? The docs say, "This change affects all newly harvested datasets."

The plan is to have a mix?

Or should we add to the release note that all datasets should be re-harvested? Is that crazy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the fix isn't retroactive, unless all datasets are re-harvested. This is not great but I have no idea how we could do this another way.

Maybe we could only suggest re-harvesting only if people are affected by/care about the discrepancy it fixes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plecor for now, can you please add something to the release note. "All datasets should be re-harvested to pick up this change." Or whatever makes sense to you. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. To reflect this, the full text of the release note is now:

The publisher value of harvested datasets is now attributed to the dataset's distributor instead of its producer. This improves the citation associated with these datasets, but the change only affects newly harvested datasets. All datasets should be re-harvested if you wish to pick up this change on already harvested datasets. For more information, see the guides, #8739, and #9013.

@cmbz cmbz added the FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) FY25 Sprint 12 FY25 Sprint 12 (2024-12-04 - 2024-12-18) GREI 3 Search and Browse Size: 10 A percentage of a sprint. 7 hours. Type: Feature a feature request
Projects
Status: Ready for QA ⏩
Development

Successfully merging this pull request may close these issues.

Publisher tag in oai_dc format
8 participants