Harvest metadata from SRDA #243

jggautier · 2024-01-26T19:59:07Z

After Dataverse v6.1 is applied to Harvard Dataverse, we'll need to re-harvest SRDA's metadata into the collection at https://dataverse.harvard.edu/dataverse/srda_harvested

OAI URL:
https://srda.sinica.edu.tw/oai_pmh/oai

Set:
None

This is following the work discussed in the GitHub issue at IQSS/dataverse#7624

The text was updated successfully, but these errors were encountered:

landreev · 2024-01-28T15:46:18Z

I saw that it failed to harvest any records in prod. on Friday.
I believe we need to remove all the SRDA records harvested from Datacite first, before trying to re-harvest their metadata into the same collection. I.e., we will need to delete that client, then re-create it from scratch.
I unscheduled it for now, let's handle this after 6.1 is deployed (in the next couple of days, hopefully).

landreev · 2024-01-29T13:26:08Z

BTW, the 6.1 release patch that I'm deploying in prod. is already running on demo.
So you can see the redirects working for the records harvested from https://srda.sinica.edu.tw/ there:
https://demo.dataverse.org/dataverse/srda

landreev · 2024-01-30T19:07:33Z

Also, note that the old records harvested via Datacite in https://dataverse.harvard.edu/dataverse/srda_harvested are now redirecting properly.
Let's coordinate about re-harvesting their records directly, later this week maybe?

jggautier · 2024-01-30T20:09:46Z

Nice! Yeah anytime this week would be fine I think. If you don't mind, could you delete the client, harvest directly from them later this week, and let me know if that went okay? Or I could do it when you think it's best this week. I'll have blocks of time this Thursday and Friday afternoon that would work well.

landreev · 2024-02-05T14:54:45Z

A quick update: it took a good portion of the day on Fri. to delete all the datacite-harvested records.
An attempt to re-harvest directly did not succeed, unfortunately. I did re-run that direct harvest on demo, just as a quick check, and it worked like a charm again there. I have no explanation of what's going on. Another production mystery that will need to be resolved.

landreev · 2024-02-06T14:22:25Z

https://dataverse.harvard.edu/dataverse/srda_harvested

I left the client scheduled, weekly, Sun. 10pm. Let's keep an eye on it and check what happens over the weekend.

jggautier · 2024-02-06T16:19:03Z

Our contact at SRDA emailed us that he noticed that those 2926 records were harvested. That's one short of the 2,927 records in https://srda.sinica.edu.tw/oai_pmh/oai?verb=ListRecords&metadataPrefix=oai_dc (which interestingly has no pagination, so it takes a while to load the metadata of all 2,927 records).

I let him know that you told Harvard Dataverse to try again this Sunday and that we'll see what happens then 🤞

landreev · 2024-02-06T17:09:13Z

2928 actually, 2 failed. I can give them the 2 identifiers - it looks like it may be an intermittent problem on their end.
I would maybe also ask them why they are formatting their identifiers as <dc:identifier>10.6141/TW-SRDA-AA000001-1</dc:identifier>; instead of <dc:identifier>https://doi.org/10.6141/TW-SRDA-AA000001-1</dc:identifier> or <dc:identifier>doi:10.6141/TW-SRDA-AA000001-1</dc:identifier>? - that was the part that was causing most of the problems with their archive so far.

landreev · 2024-02-06T17:21:34Z

Those may all be moot-er points. On a closer look, I am very apprehensive about having their archive harvested regularly on a schedule. Their server does not appear to understand the incremental harvesting parameter (from=). Sending ListIdentifiers with from=2024-02-06 returns the same list of 2928 records. (Our Harvester doesn't use ListRecords; it does ListIdentifiers, then GetRecord individually). The <datestamp> entries don't seem to represent the modification times of the records - it is simply the current date/time for all the records.
I am not ok with re-harvesting ~3000 records from scratch every week. It would be preferable if they could fix this.
No pagination is another thing, yes; but less of a big deal compared to this.

jggautier · 2024-02-07T14:04:03Z

Okay, in the email thread I asked about the identifier format and about the datestamp.

landreev · 2024-02-07T20:33:28Z

OK, great, saw your message.
If they ask followup questions about the identifier format - that part is not a problem for us anymore; Dataverse was confused by it, but I worked around it. It's just something they may want to consider - since the identifiers in question are registered and resolvable DOIs, why not advertise them as such?

The "from=" parameter on the other hand is a fairly important thing.

jggautier · 2024-02-20T13:34:35Z

@landreev, our contact, Chucky, wrote back that "regarding the date stamp issue, we have already instructed our IT personnel to update the date to reflect the latest data version release date in https://srda.sinica.edu.tw/oai_pmh/oai?verb=ListRecords&metadataPrefix=oai_dc," and that they'd make corrections to the identifier format (dc: identifier).

I'm not sure if it matters that they're referring to a ListRecords request and not the ListIdentifiers request that you said Dataverse uses. Should I ask them to clarify?

landreev · 2024-02-26T02:12:50Z

I scheduled it again for Sun. night, it's running now, we'll check on it tomorrow and see what it does.
It's usually the same code underneath that drives the behavior of ListRecords and ListIdentifiers, when it comes to dates. I didn't think it was necessary to ask, unless we see something weird.
I'm a little worried about how they phrased it - "update the date to reflect the latest data version". With "the date" possibly implying that it's the same date for all their records; which would again mean having to reharvest all their records every time. But let's see what/how many records we get tonight.

landreev · 2024-02-26T14:33:18Z

It did re-harvest all their records last night, i.e., still not incremental. I'm assuming we can just harvest their stuff like this once in a while manually for the time being, until they get the from= parameter to work on their end.
One small improvement is that there were no broken records this time (there were 2 that failed to import the last time around).

jggautier · 2024-02-26T14:44:41Z

Nice! Okay. I got the sense that they hadn't updated the dates, yet, and Chucky will email again when they have. I can reach out in a month, on Mar 26, to ask if they've made the change, too. If they have by then we can set the schedule to weekly. If they haven't yet, we can just run another manual harvest then.

How's that sound?

cmbz · 2024-03-12T17:49:52Z

2024/03/12

We are now receiving records from this repository (running the harvesting jobs "manually" instead of on a schedule)
SRDA will need to eventually fix their system so that we can harvest incrementally and schedule weekly harvest
Keeping this issue open until they are able to fix their repository and so that we will remember to reharvest periodically

jggautier · 2024-03-26T15:39:07Z

I see different dates in the <datestamp> of each record in https://srda.sinica.edu.tw/oai_pmh/oai?verb=ListRecords&metadataPrefix=oai_dc, but I'm not sure if that means that the datestamp issue has been resolved so that Harvard Dataverse can harvest incrementally and we can schedule weekly harvest.

I've emailed Chucky to ask if it's been resolved and if not, to email us when it has been so that I don't forget. See the email thread in RT

jggautier · 2024-03-27T17:51:40Z

@landreev, Chucky's replied and I'm concerned that I've cause some confusion, maybe because I don't understand enough about how incremental harvesting works.

I've CC'ed you on the email thread. Could you reply when you get the chance?

cmbz · 2024-06-25T14:23:21Z

@jggautier @landreev bumping this to find out status

jggautier · 2024-06-25T14:35:30Z

What a coincidence! I was just writing the comment at IQSS/dataverse-pm#171 (comment) about following up with all of the repository contacts who've emailed me, although in SRDA's case I see that there are some issues unique to them to follow up on.

jggautier · 2024-06-25T14:55:28Z

Looks like there are 2970 records in SRDA's OAI-PMH feed, and Harvard Dataverse has 2929 of them (https://dataverse.harvard.edu/dataverse/srda_harvested). I think we turned off scheduled harvesting - so Harvard Dataverse isn't trying to update harvested records each week - while indexing-related issues are being worked on, so it's been a while since Harvard Dataverse has tried to update the records it has from SRDA.

In late March our contact Chucky emailed:

In addition, I noticed that your colleague raised an issue regarding the ineffectiveness of the incremental harvesting parameter (from=) in the GitHub issue at #243. I have already asked our IT team to look into this and resolve the issue.

I think the "from" parameter is working now for SRDA's OAI-PMH feed, at least from what I can tell when I try to use it, such as https://srda.sinica.edu.tw/oai_pmh/oai?verb=ListRecords&from=2024-02-25T00:00:00Z&metadataPrefix=oai_dc.

In addition to letting them know that we continue to work on indexing-related issues that are affecting harvesting, I can ask them to confirm that they've resolved this "from" parameter issue.

@cmbz and @landreev, what do you think?

cmbz · 2024-06-25T15:52:14Z

@jggautier sounds good to me.

jggautier · 2024-06-26T14:39:59Z

@landreev, Chucky from SRDA confirmed in the email that the "from" parameter is working now. 🎉

landreev · 2024-06-28T21:50:06Z

OK, I'll try to harvest from them and see if it is now incremental.
But yes, scheduled regular harvesting is still off, pending fixing reindexing for harvested content, unfortunately. This is a blocker r/n for most issues that track harvesting content from specific remote archives.

landreev · 2024-09-24T19:05:43Z

As I was working on a harvesting-related dev. issue, I've been running test harvests from various places. I have discovered in the process that harvesting from SRDA is in fact working really well now. Their new OAI-PMH server is now properly serving just the records updated or added since the last harvest, so we are getting their records in reasonable increments:

Our harvested SRDA collection is up-to-date at the moment: https://dataverse.harvard.edu/dataverse/srda_harvested
So, I'm wondering if we should close the issue as completed - ?

To be clear, I don't know if the currently-running 6.3 would be able to re-harvest their entire collection (or, specifically, to re-index it in real time) if we wiped it clean and tried to do that from scratch. But in this incremental mode it appears to be working properly. Also, this particular collection is much easier on the server indexing-wise, since we are harvesting in oai_dc (no files, and less metadata to index).

jggautier · 2024-09-25T17:57:53Z

Hey @landreev. Would you mind if I tell Harvard Dataverse to harvest from SRDA each week? The schedule for the harvesting client is set to "none" right now.

I'd consider this GitHub issue completed if we knew that we could update records from SRDA on a regular basis.

landreev · 2024-09-26T14:44:03Z

@jggautier Yes, sure.

landreev · 2024-09-26T14:48:44Z

@jggautier BTW, a minor thing I noticed about SRDA content: it looks like there is an occasional delay between their new documents being released, and the DOIs getting registered for them. In practical terms this means that after we harvest from them, the redirects are not immediately working for these newest records (appearing at the top of the collection) for the first couple of days.
This appears to be a temporary issue that reliably "fixes itself".

jggautier · 2024-09-26T14:53:35Z

Ah okay, think it's worth asking if they're aware of this and how it might affect discovery?

Also I told HDV to harvest from SRDA every Sunday at 12am. So I'm all for considering this GitHub issue completed and closing it.

jggautier · 2024-09-27T15:18:28Z

Just following up that I emailed Chucky from SRDA with the good news about the improved harvesting and asked about that delay you noticed. The emails are at https://help.hmdc.harvard.edu/Ticket/Display.html?id=287243#txn-8459551.

landreev added Feature: Harvesting Curation for Harvard Collection Size: 3 A percentage of a sprint. labels Feb 6, 2024

landreev self-assigned this Feb 6, 2024

cmbz added this to IQSS Dataverse Project Feb 6, 2024

cmbz moved this to SPRINT READY in IQSS Dataverse Project Feb 6, 2024

cmbz mentioned this issue Feb 6, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

57 tasks

cmbz added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Mar 27, 2024

scolapasta moved this from SPRINT READY to On Hold ⌛ in IQSS Dataverse Project Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest metadata from SRDA #243

Harvest metadata from SRDA #243

jggautier commented Jan 26, 2024

landreev commented Jan 28, 2024

landreev commented Jan 29, 2024

landreev commented Jan 30, 2024

jggautier commented Jan 30, 2024

landreev commented Feb 5, 2024

landreev commented Feb 6, 2024

jggautier commented Feb 6, 2024 •

edited

Loading

landreev commented Feb 6, 2024

landreev commented Feb 6, 2024

jggautier commented Feb 7, 2024

landreev commented Feb 7, 2024

jggautier commented Feb 20, 2024

landreev commented Feb 26, 2024

landreev commented Feb 26, 2024

jggautier commented Feb 26, 2024

cmbz commented Mar 12, 2024 •

edited by jggautier

Loading

jggautier commented Mar 26, 2024 •

edited

Loading

jggautier commented Mar 27, 2024 •

edited

Loading

cmbz commented Jun 25, 2024

jggautier commented Jun 25, 2024 •

edited

Loading

jggautier commented Jun 25, 2024 •

edited

Loading

cmbz commented Jun 25, 2024

jggautier commented Jun 26, 2024

landreev commented Jun 28, 2024

landreev commented Sep 24, 2024

jggautier commented Sep 25, 2024

landreev commented Sep 26, 2024

landreev commented Sep 26, 2024

jggautier commented Sep 26, 2024

jggautier commented Sep 27, 2024

Harvest metadata from SRDA #243

Harvest metadata from SRDA #243

Comments

jggautier commented Jan 26, 2024

landreev commented Jan 28, 2024

landreev commented Jan 29, 2024

landreev commented Jan 30, 2024

jggautier commented Jan 30, 2024

landreev commented Feb 5, 2024

landreev commented Feb 6, 2024

jggautier commented Feb 6, 2024 • edited Loading

landreev commented Feb 6, 2024

landreev commented Feb 6, 2024

jggautier commented Feb 7, 2024

landreev commented Feb 7, 2024

jggautier commented Feb 20, 2024

landreev commented Feb 26, 2024

landreev commented Feb 26, 2024

jggautier commented Feb 26, 2024

cmbz commented Mar 12, 2024 • edited by jggautier Loading

jggautier commented Mar 26, 2024 • edited Loading

jggautier commented Mar 27, 2024 • edited Loading

cmbz commented Jun 25, 2024

jggautier commented Jun 25, 2024 • edited Loading

jggautier commented Jun 25, 2024 • edited Loading

cmbz commented Jun 25, 2024

jggautier commented Jun 26, 2024

landreev commented Jun 28, 2024

landreev commented Sep 24, 2024

jggautier commented Sep 25, 2024

landreev commented Sep 26, 2024

landreev commented Sep 26, 2024

jggautier commented Sep 26, 2024

jggautier commented Sep 27, 2024

jggautier commented Feb 6, 2024 •

edited

Loading

cmbz commented Mar 12, 2024 •

edited by jggautier

Loading

jggautier commented Mar 26, 2024 •

edited

Loading

jggautier commented Mar 27, 2024 •

edited

Loading

jggautier commented Jun 25, 2024 •

edited

Loading

jggautier commented Jun 25, 2024 •

edited

Loading