Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest: map publisher tag to distributorName #9013

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/release-notes/8739-publisher-during-harvesting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The publisher value of harvested datasets is now attributed to the dataset's distributor instead of its producer. This change affects all newly harvested datasets. For more information, see #8739
plecor marked this conversation as resolved.
Show resolved Hide resolved
5 changes: 5 additions & 0 deletions doc/sphinx-guides/source/admin/harvestclients.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,8 @@ What if a Run Fails?
Each harvesting client run logs a separate file per run to the app server's default logging directory (``/usr/local/payara6/glassfish/domains/domain1/logs/`` unless you've changed it). Look for filenames in the format ``harvest_TARGET_YYYY_MM_DD_timestamp.log`` to get a better idea of what's going wrong.

Note that you'll want to run a minimum of Dataverse Software 4.6, optimally 4.18 or beyond, for the best OAI-PMH interoperability.

Harvesting Client Changelog
---------------------------

- As of Dataverse 6.3, the publisher value of harvested datasets is now attributed to the dataset's distributor instead of its producer. This change affects all newly harvested datasets. For more information, see https://github.com/IQSS/dataverse/pull/9013
plecor marked this conversation as resolved.
Show resolved Hide resolved
pdurbin marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
update foreignmetadatafieldmapping set datasetfieldname = 'distributorName' where foreignfieldxpath = ':publisher';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about datasets that are already harvested? The docs say, "This change affects all newly harvested datasets."

The plan is to have a mix?

Or should we add to the release note that all datasets should be re-harvested? Is that crazy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the fix isn't retroactive, unless all datasets are re-harvested. This is not great but I have no idea how we could do this another way.

Maybe we could only suggest re-harvesting only if people are affected by/care about the discrepancy it fixes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plecor for now, can you please add something to the release note. "All datasets should be re-harvested to pick up this change." Or whatever makes sense to you. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. To reflect this, the full text of the release note is now:

The publisher value of harvested datasets is now attributed to the dataset's distributor instead of its producer. This improves the citation associated with these datasets, but the change only affects newly harvested datasets. All datasets should be re-harvested if you wish to pick up this change on already harvested datasets. For more information, see the guides, #8739, and #9013.

Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,11 @@ private void harvestingClientRun(boolean allowHarvestingMissingCVV) throws Inte
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's been so long that https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-9013/7/display/redirect shows a failure but that job is now a 404 so I can't see any details.

After you merge the latest from develop let's keep an eye on the new Jenkins run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still seeing failures. I just kicked off another run. Fingers crossed: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-9013/12/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time all checks were green except for "continuous-integration/jenkins/pr-merge " that was pending. I'm not clear on what it is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Now that the app is being deployed following the SQL script renaming, I can see that a couple harvesting-related tests are failing:

  • edu.harvard.iq.dataverse.api.HarvestingClientsIT.testHarvestingClientRun_AllowHarvestingMissingCVV_False(HarvestingClientsIT.java:187)
  • edu.harvard.iq.dataverse.api.HarvestingClientsIT.testHarvestingClientRun_AllowHarvestingMissingCVV_True(HarvestingClientsIT.java:191)

@plecor can you please take a look? Do you need help with how to run these tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pdurbin ,

I've re-run the tests from scratch and I wonder if it could be a chicken and egg situation. The new assertion fails if I run the tests from this branch because the sql migration file it's adding hasn't been run. If I manually update the db then the tests run fine. Could it be something similar going on with Jenkins?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the issue is that in a new install, the corresponding data is initialized by afterMigrate__1-7256-upsert-referenceData.sql after the other migrations, so the data we're trying to update doesn't exist yet.

I updated this file with the same change as the sql migration file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Thanks.

@poikilotherm judging from https://github.com/IQSS/dataverse/blame/c8499ba9553ac46cf3adc56d1b9e56f0c781d30f/src/main/resources/db/migration/afterMigrate__1-7256-upsert-referenceData.sql you added that "after migrate" file. What do you think? Any risk in changing it? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqmyers took a look and had this to say:

"that afterMigrate script appears to run after everything else, every time you restart, rather than being one of the scripts tracked by hash value in the flyway_schema_history table. So - it looks OK to me to update it. (Mostly from looking at https://documentation.red-gate.com/fd/callback-concept-184127466.html and related and verifying that I don't see it in the flyway_schema_history table)."

// verify count after collecting global ids
assertEquals(expectedNumberOfSetsHarvested, jsonPath.getInt("data.total_count"));

// ensure the publisher name is present in the harvested dataset citation
Response harvestedDataverse = given().get(ARCHIVE_URL + "/api/dataverses/1");
String harvestedDataverseName = harvestedDataverse.getBody().jsonPath().getString("data.name");
assertTrue(jsonPath.getString("data.items[0].citation").contains(harvestedDataverseName));

// Fail if it hasn't completed in maxWait seconds
assertTrue(i < maxWait);
Expand Down
Loading