Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would Data INRA like Harvard Dataverse to harvest? #236

Closed
jggautier opened this issue Dec 1, 2023 · 10 comments
Closed

What would Data INRA like Harvard Dataverse to harvest? #236

jggautier opened this issue Dec 1, 2023 · 10 comments

Comments

@jggautier
Copy link
Collaborator

jggautier commented Dec 1, 2023

The harvesting job that used to harvest metadata into the collection at https://dataverse.harvard.edu/dataverse/inra_harvested has been failing for a while. It's one of several failing jobs. See #92.

So clicking on the title of each dataset no longer leads to the dataset but to a 404 page, and Harvard Dataverse hasn't been updating the metadata it's harvested.

The installation's URL, https://data.inra.fr, now redirects to https://entrepot.recherche.data.gouv.fr/dataverse/inrae. Dimitri Szabo let us know about this and we updated the Dataverse map (IQSS/dataverse-installations#162), but we didn't adjust the harvesting job settings, which are still trying to use the OAI-PMH endpoint https://data.inra.fr/oai, which isn't working anymore.

We might need to update the settings so that Harvard Dataverse is harvesting from https://entrepot.recherche.data.gouv.fr/oai instead, and possibly adjust the URL, name and description of the collection at https://dataverse.harvard.edu/dataverse/inra_harvested depending on what the folks from INRAE would like Harvard Dataverse to harvest.

Their harvesting sets are listed at https://entrepot.recherche.data.gouv.fr/oai?verb=ListSets and the list includes a set called INRAE. Who ever works on this issue might ask Dimitri Szabo ([email protected]) about what they'd like Harvard Dataverse to harvest, then adjust the settings to re-harvest their metadata and ensure that it stays up to date.

@pdurbin
Copy link
Member

pdurbin commented Dec 1, 2023

I sent @DS-INRA (Dimitri) a link to this issue.

@DS-INRAE
Copy link
Member

DS-INRAE commented Dec 7, 2023

Hello, reacting only today, thanks for raising the issue !
Please use the set ALL instead of INRAE from https://entrepot.recherche.data.gouv.fr/oai as other organizations are now also in the repository :) .

@jggautier
Copy link
Collaborator Author

jggautier commented Dec 7, 2023

Perfect. Thanks @DS-INRA! We'll use that set then.

I'll also change the name of the collection at https://dataverse.harvard.edu/dataverse/inra_harvested, to "Recherche Data Gouv Harvested Dataverse", the collection's URL to https://dataverse.harvard.edu/dataverse/recherchedatagouv, and I'll remove the description.

Let me know if you'd like any done differently with those things changed, too.

@jggautier
Copy link
Collaborator Author

I told Dataverse to delete the harvesting client "inra". The records were removed from https://dataverse.harvard.edu/dataverse/inra_harvested within minutes, but the client remains in the table at https://dataverse.harvard.edu/harvestclients.xhtml?dataverseId=1, with "DELETE IN PROGRESS" in the "Last Results" column. This sounds similar to what was reported in the GitHub issue at IQSS/dataverse#7052, although I'm not sure if Harvard Dataverse's server was rebooted while the records were being deleted.

I'll check tomorrow to see if the client has been deleted. If it has been, I'll create a new harvesting client to harvest the ALL set from https://entrepot.recherche.data.gouv.fr/oai.

If the client hasn't been deleted by tomorrow, I'll ask my developer colleagues if the server was rebooted this afternoon, if that's why the client hasn't been deleted, and if the server can be deleted another way so that I can create a new client and harvest records from the ALL set from https://entrepot.recherche.data.gouv.fr/oai.

@jggautier
Copy link
Collaborator Author

jggautier commented Dec 8, 2023

Client was deleted 🎉, and records in the ALL set of https://entrepot.recherche.data.gouv.fr/oai are being harvested into https://dataverse.harvard.edu/dataverse/recherchedatagouv. I'll close this issue and open another if there are any problems.

Thanks @DS-INRA!

@DS-INRAE
Copy link
Member

DS-INRAE commented Dec 8, 2023

Great, many thanks !

@jggautier
Copy link
Collaborator Author

Just an update here. 37 records were harvested into https://dataverse.harvard.edu/dataverse/recherchedatagouv and Dataverse reports that it couldn't harvest 2325 records.

I'll update the issue at #92, where I've been writing about failing harvests.

@DS-INRAE
Copy link
Member

I'll update the issue at #92, where I've been writing about failing harvests.

Thanks, feel free to tag me there so that I don't miss it!

@pdurbin
Copy link
Member

pdurbin commented Jul 17, 2024

@DS-INRA you can also click "subscribe" on the issue.

@jggautier
Copy link
Collaborator Author

jggautier commented Jul 17, 2024

@DS-INRA, #92 was closed instead, a few months after I left that comment about updating that Github issue. So I don't think you need to subscribe to it.

Harvard Dataverse has still harvested just 37 records from Recherche Data Gouv and has stopped harvesting from all repositories that it used to so that we can address indexing issues that were affecting how well it harvests.

Eventually we'll keep looking into specific cases where Harvard Dataverse isn't harvesting well or at all from certain repositories, like Recherche Data Gouv, and we'll ping you at @DS-INRA if we wind up creating a new GitHub issue about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants