-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: Review if inability to harvest all records from certain repositories will be or has been resolved when other harvesting-related issues are addressed #92
Comments
@jggautier - I'm hoping that whatever resolves IQSS/dataverse#7398 resolves this as well. I can look in the logs and see the timers being set:
|
The harvests are running when they're set to run, but Dataverse reports that dataset metadata failed to be retrieved. This also happens when I try to run the harvest "manually". |
@jggautier Got it. I'm seeing this in the logs, for CIFOR at least: [SEVERE] [] [] [tid: _ThreadID=399 _ThreadName=__ejb-thread-pool16] [timeMillis: 1605434485070] [levelValue: 1000] [[ It looks like they've customized the subjects at https://data.cifor.org/dataverse which I assume is leading to the failure? |
Not sure what's going on ICARDA, but I am seeing a lot of this in the logs: [SEVERE] [] [] [tid: _ThreadID=393 _ThreadName=__ejb-thread-pool10] [timeMillis: 1605337269685] [levelValue: 1000] [[ ICARDA is on 4.14, and know there have been a lot of improvements in Harvesting in recent releases. Are we seeing these failures on harvesting servers from more recently updated dataverse installations? |
Unfortunately no (or fortunately). All of the failing harvests are from repositories running 4.20 or earlier:
In case this helps, too, Harvard Dataverse was able to harvest about 3600 of Data INRAE's datasets. One dataset is using "Other" for Subject, which is one of the 14 "Subject" terms Dataverse ships with, and it looks like the rest of the harvested datasets are using Subject terms that don't ship with Dataverse, like the dataset at https://doi.org/10.15454/1.4938215048007249E12. |
Just updating this issue. @philippconzett reported that the metadata that Harvard Dataverse harvests from DataverseNO is out of date. The superuser's harvesting clients page (e.g. https://dataverse.harvard.edu/harvestclients.xhtml?dataverseId=1) continues to report - after each scheduled attempt - that it's failing to get or update records from 15 harvesting jobs, including one from DataverseNO. Here's a list I made of harvesting jobs that are failing. I excluded things on the harvesting clients page that aren't being updated weekly, like ICPSR (#63) OAI-PMH harvesting in Dataverse is being improved, but I'm not sure how the planned improvements will affect these failing harvesting jobs. |
Just an update: Harvard Dataverse is now harvesting from Recherche Data Gouv, which has replaced Data INRAE. See #236. That harvesting job got only 37 of 2325 records. |
2023/12/18
|
Thanks @cmbz, I renamed this issue like we decided over Slack. |
What I wrote earlier in this GitHub issue - about cases where Harvard Dataverse isn't able to harvest all records from certain Dataverse repositories - is three years old now. So I wonder if this spike should include:
@cmbz and @landreev, does that all make sense? What do you think? I could help with 1, 2 and 4, and maybe 3, though it's been tough for me to tell why harvesting has failed and I could use help with figuring out why. |
@jggautier I think that the three of us (at a minimum) should get together to discuss how best to rescope, redefine, and consolidate recommendations for work needed for the harvesting feature, overall. Currently, these conversations and ideas are scattered across multiple issues and spikes making it difficult, imo, to keep tabs on them all. The fact that we're approaching GREI Year 3 also (see the related Search & Browse project management epic: IQSS/dataverse-pm#117) makes it a good time to review our work and update our plans. Your thoughts? Which other core team members should be involved in this conversation and effort? @landreev please also share your thoughts. |
Getting together to discuss sounds good to me! |
2024/02/05 |
I'm happy to participate in discussions on this. I'm not sure there is any better response to this than to continue addressing the known problems one at a time. There may not be much potential for addressing them in any definable chunks. Such as, being able to say "once this issue is addressed, we'll be able to close this whole class of open issues: ...", unfortunately. I may be wrong of course, but it seems like with every problematic remote archive we've looked into there was some new and unique issue discovered that needed to be fixed. |
2024/03/13 |
Yes I think so. |
2024/03/13
|
According to the superuser dashboard, when Harvard Dataverse tries to harvest from 6 clients, it fails to harvest a number of datasets. These are scheduled runs using dataverse_json as the metadata format. Here are details for two:
Harvard Dataverse users would like to link datasets from the CIFOR repository into their dataverse, but we're not able to until Harvard Dataverse is able to harvest the newer datasets. (See https://help.hmdc.harvard.edu/Ticket/Display.html?id=295542)
The text was updated successfully, but these errors were encountered: