Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Review if inability to harvest all records from certain repositories will be or has been resolved when other harvesting-related issues are addressed #92

Closed
jggautier opened this issue Nov 16, 2020 · 18 comments
Assignees
Labels
bug Something isn't working Feature: Harvesting NIH GREI General work related to any of the NIH GREI aims

Comments

@jggautier
Copy link
Collaborator

According to the superuser dashboard, when Harvard Dataverse tries to harvest from 6 clients, it fails to harvest a number of datasets. These are scheduled runs using dataverse_json as the metadata format. Here are details for two:

client last run last results
cifor Sun Nov 15 05:00:00 EST 2020 SUCCESS; 0 harvested, 0 deleted, 117 failed
icarda Sat Nov 14 02:00:00 EST 2020 SUCCESS; 0 harvested, 0 deleted, 215 failed

Harvard Dataverse users would like to link datasets from the CIFOR repository into their dataverse, but we're not able to until Harvard Dataverse is able to harvest the newer datasets. (See https://help.hmdc.harvard.edu/Ticket/Display.html?id=295542)

@jggautier jggautier added the bug Something isn't working label Nov 16, 2020
@djbrooke djbrooke self-assigned this Nov 18, 2020
@djbrooke
Copy link
Contributor

@jggautier - I'm hoping that whatever resolves IQSS/dataverse#7398 resolves this as well. I can look in the logs and see the timers being set:

Setting timer for harvesting client cifor, initial expiration: Sun Nov 22 05:00:00 EST 2020]]

@jggautier
Copy link
Collaborator Author

jggautier commented Nov 18, 2020

The harvests are running when they're set to run, but Dataverse reports that dataset metadata failed to be retrieved. This also happens when I try to run the harvest "manually".

@jggautier jggautier changed the title Make scheduled harvests succeed again 🇺🇸 Make harvests retrieve all datasets Nov 18, 2020
@jggautier jggautier changed the title Make harvests retrieve all datasets Make harvests retrieve all dataset metadata Nov 19, 2020
@djbrooke
Copy link
Contributor

@jggautier Got it. I'm seeing this in the logs, for CIFOR at least:

[SEVERE] [] [] [tid: _ThreadID=399 _ThreadName=__ejb-thread-pool16] [timeMillis: 1605434485070] [levelValue: 1000] [[
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'Climate Change, Energy and low carbon development (CCE)' does not exist in type 'subject')

It looks like they've customized the subjects at https://data.cifor.org/dataverse which I assume is leading to the failure?

@djbrooke
Copy link
Contributor

@jggautier

Not sure what's going on ICARDA, but I am seeing a lot of this in the logs:

[SEVERE] [] [] [tid: _ThreadID=393 _ThreadName=__ejb-thread-pool10] [timeMillis: 1605337269685] [levelValue: 1000] [[
edu.harvard.iq.dataverse.harvest.client.oai.OaiHandlerException: IOException executing GetRecord: Failed to download extended metadata.

ICARDA is on 4.14, and know there have been a lot of improvements in Harvesting in recent releases. Are we seeing these failures on harvesting servers from more recently updated dataverse installations?

@jggautier
Copy link
Collaborator Author

jggautier commented Nov 19, 2020

Unfortunately no (or fortunately). All of the failing harvests are from repositories running 4.20 or earlier:

  • CIFOR
  • ICARDA
  • Scholars Portal
  • Data INRAE
  • Libre Data (U of V)

In case this helps, too, Harvard Dataverse was able to harvest about 3600 of Data INRAE's datasets. One dataset is using "Other" for Subject, which is one of the 14 "Subject" terms Dataverse ships with, and it looks like the rest of the harvested datasets are using Subject terms that don't ship with Dataverse, like the dataset at https://doi.org/10.15454/1.4938215048007249E12.

@jggautier jggautier changed the title Make harvests retrieve all dataset metadata Make harvests retrieve all dataset records Mar 15, 2021
@jggautier
Copy link
Collaborator Author

jggautier commented Nov 17, 2022

Just updating this issue. @philippconzett reported that the metadata that Harvard Dataverse harvests from DataverseNO is out of date. The superuser's harvesting clients page (e.g. https://dataverse.harvard.edu/harvestclients.xhtml?dataverseId=1) continues to report - after each scheduled attempt - that it's failing to get or update records from 15 harvesting jobs, including one from DataverseNO.

Here's a list I made of harvesting jobs that are failing. I excluded things on the harvesting clients page that aren't being updated weekly, like ICPSR (#63)

Screen Shot 2022-11-17 at 3 35 41 PM

OAI-PMH harvesting in Dataverse is being improved, but I'm not sure how the planned improvements will affect these failing harvesting jobs.

@sbarbosadataverse
Copy link

What's the likelihood this issue will be fixed with the Harvesting updates in progress? @mreekie @siacus
We don't want to add this to the Dataverse Backlog for Harvard Dataverse if they may get fixed by the harvesting updates.

Thanks

@jggautier
Copy link
Collaborator Author

jggautier commented Dec 11, 2023

Just an update: Harvard Dataverse is now harvesting from Recherche Data Gouv, which has replaced Data INRAE. See #236.

That harvesting job got only 37 of 2325 records.

@cmbz cmbz added pm.GREI-d-2.4.1 NIH, yr2, aim4, task1: Implement packaging standards based on working group feedback Feature: Harvesting NIH GREI General work related to any of the NIH GREI aims and removed pm.GREI-d-2.4.1 NIH, yr2, aim4, task1: Implement packaging standards based on working group feedback labels Dec 18, 2023
@cmbz
Copy link
Collaborator

cmbz commented Dec 18, 2023

2023/12/18

  • Create spike to investigate whether or not the use cases here will be or have been resolved when other Harvest-related issues are addressed.
  • @jggautier will create the spike

@jggautier jggautier changed the title Make harvests retrieve all dataset records Spike: Make harvests retrieve all dataset records Dec 19, 2023
@jggautier jggautier changed the title Spike: Make harvests retrieve all dataset records Spike: Review if inability to harvest all records from certain repositories will be or has been resolved when other Harvest-related issues are addressed Dec 19, 2023
@jggautier
Copy link
Collaborator Author

Thanks @cmbz, I renamed this issue like we decided over Slack.

@jggautier jggautier changed the title Spike: Review if inability to harvest all records from certain repositories will be or has been resolved when other Harvest-related issues are addressed Spike: Review if inability to harvest all records from certain repositories will be or has been resolved when other harvesting-related issues are addressed Dec 20, 2023
@jggautier
Copy link
Collaborator Author

jggautier commented Jan 2, 2024

What I wrote earlier in this GitHub issue - about cases where Harvard Dataverse isn't able to harvest all records from certain Dataverse repositories - is three years old now. So I wonder if this spike should include:

  1. Someone creating a new list of the Dataverse repositories Harvard Dataverse isn't able to harvest completely
  2. Someone trying to re-harvest from these installations using the DDI Codebook metadata instead of the Dataverse JSON metadata. This is what @landreev proposed for the issue at Re-harvest from Borealis Repository #172 that's about re-harvesting metadata from Borealis
  3. Someone taking a closer look at why all records aren't harvested when DDI-C metadata is used (if harvesting DDI-C metadata doesn't work)
  4. Someone evaluating how harvesting DDI-C metadata might affect data discovery, since DDI-C metadata will never include all metadata, including metadata from fields that Dataverse ships with

@cmbz and @landreev, does that all make sense? What do you think? I could help with 1, 2 and 4, and maybe 3, though it's been tough for me to tell why harvesting has failed and I could use help with figuring out why.

@cmbz
Copy link
Collaborator

cmbz commented Jan 16, 2024

@jggautier I think that the three of us (at a minimum) should get together to discuss how best to rescope, redefine, and consolidate recommendations for work needed for the harvesting feature, overall. Currently, these conversations and ideas are scattered across multiple issues and spikes making it difficult, imo, to keep tabs on them all.

The fact that we're approaching GREI Year 3 also (see the related Search & Browse project management epic: IQSS/dataverse-pm#117) makes it a good time to review our work and update our plans.

Your thoughts?

Which other core team members should be involved in this conversation and effort? @landreev please also share your thoughts.

@jggautier
Copy link
Collaborator Author

Getting together to discuss sounds good to me!

@cmbz
Copy link
Collaborator

cmbz commented Feb 5, 2024

2024/02/05
Status set to waiting while we organize a meeting to discuss.

@cmbz cmbz moved this to Waiting ⌛ in IQSS Dataverse Project Feb 5, 2024
@landreev
Copy link
Collaborator

landreev commented Feb 5, 2024

I'm happy to participate in discussions on this. I'm not sure there is any better response to this than to continue addressing the known problems one at a time. There may not be much potential for addressing them in any definable chunks. Such as, being able to say "once this issue is addressed, we'll be able to close this whole class of open issues: ...", unfortunately. I may be wrong of course, but it seems like with every problematic remote archive we've looked into there was some new and unique issue discovered that needed to be fixed.

@cmbz
Copy link
Collaborator

cmbz commented Mar 12, 2024

2024/03/13
@landreev and @jggautier Should we close this issue now that we've discussed next steps in our meeting today?

@jggautier
Copy link
Collaborator Author

Yes I think so.

@cmbz
Copy link
Collaborator

cmbz commented Mar 12, 2024

2024/03/13

@cmbz cmbz closed this as completed Mar 12, 2024
@cmbz cmbz moved this from Waiting ⌛ to Done 🧹 in IQSS Dataverse Project Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Feature: Harvesting NIH GREI General work related to any of the NIH GREI aims
Projects
None yet
Development

No branches or pull requests

5 participants