-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvest metadata from SRDA #243
Comments
I saw that it failed to harvest any records in prod. on Friday. |
BTW, the 6.1 release patch that I'm deploying in prod. is already running on demo. |
Also, note that the old records harvested via Datacite in https://dataverse.harvard.edu/dataverse/srda_harvested are now redirecting properly. |
Nice! Yeah anytime this week would be fine I think. If you don't mind, could you delete the client, harvest directly from them later this week, and let me know if that went okay? Or I could do it when you think it's best this week. I'll have blocks of time this Thursday and Friday afternoon that would work well. |
A quick update: it took a good portion of the day on Fri. to delete all the datacite-harvested records. |
https://dataverse.harvard.edu/dataverse/srda_harvested I left the client scheduled, weekly, Sun. 10pm. Let's keep an eye on it and check what happens over the weekend. |
Our contact at SRDA emailed us that he noticed that those 2926 records were harvested. That's one short of the 2,927 records in I let him know that you told Harvard Dataverse to try again this Sunday and that we'll see what happens then 🤞 |
2928 actually, 2 failed. I can give them the 2 identifiers - it looks like it may be an intermittent problem on their end. |
Those may all be moot-er points. On a closer look, I am very apprehensive about having their archive harvested regularly on a schedule. Their server does not appear to understand the incremental harvesting parameter ( |
Okay, in the email thread I asked about the identifier format and about the datestamp. |
OK, great, saw your message. The "from=" parameter on the other hand is a fairly important thing. |
@landreev, our contact, Chucky, wrote back that "regarding the date stamp issue, we have already instructed our IT personnel to update the date to reflect the latest data version release date in https://srda.sinica.edu.tw/oai_pmh/oai?verb=ListRecords&metadataPrefix=oai_dc," and that they'd make corrections to the identifier format (dc: identifier). I'm not sure if it matters that they're referring to a ListRecords request and not the ListIdentifiers request that you said Dataverse uses. Should I ask them to clarify? |
I scheduled it again for Sun. night, it's running now, we'll check on it tomorrow and see what it does. |
It did re-harvest all their records last night, i.e., still not incremental. I'm assuming we can just harvest their stuff like this once in a while manually for the time being, until they get the from= parameter to work on their end. |
Nice! Okay. I got the sense that they hadn't updated the dates, yet, and Chucky will email again when they have. I can reach out in a month, on Mar 26, to ask if they've made the change, too. If they have by then we can set the schedule to weekly. If they haven't yet, we can just run another manual harvest then. How's that sound? |
2024/03/12
|
I see different dates in the I've emailed Chucky to ask if it's been resolved and if not, to email us when it has been so that I don't forget. See the email thread in RT |
@landreev, Chucky's replied and I'm concerned that I've cause some confusion, maybe because I don't understand enough about how incremental harvesting works. I've CC'ed you on the email thread. Could you reply when you get the chance? |
@jggautier @landreev bumping this to find out status |
What a coincidence! I was just writing the comment at IQSS/dataverse-pm#171 (comment) about following up with all of the repository contacts who've emailed me, although in SRDA's case I see that there are some issues unique to them to follow up on. |
Looks like there are 2970 records in SRDA's OAI-PMH feed, and Harvard Dataverse has 2929 of them (https://dataverse.harvard.edu/dataverse/srda_harvested). I think we turned off scheduled harvesting - so Harvard Dataverse isn't trying to update harvested records each week - while indexing-related issues are being worked on, so it's been a while since Harvard Dataverse has tried to update the records it has from SRDA. In late March our contact Chucky emailed:
I think the "from" parameter is working now for SRDA's OAI-PMH feed, at least from what I can tell when I try to use it, such as https://srda.sinica.edu.tw/oai_pmh/oai?verb=ListRecords&from=2024-02-25T00:00:00Z&metadataPrefix=oai_dc. In addition to letting them know that we continue to work on indexing-related issues that are affecting harvesting, I can ask them to confirm that they've resolved this "from" parameter issue. |
@jggautier sounds good to me. |
OK, I'll try to harvest from them and see if it is now incremental. |
As I was working on a harvesting-related dev. issue, I've been running test harvests from various places. I have discovered in the process that harvesting from SRDA is in fact working really well now. Their new OAI-PMH server is now properly serving just the records updated or added since the last harvest, so we are getting their records in reasonable increments: To be clear, I don't know if the currently-running 6.3 would be able to re-harvest their entire collection (or, specifically, to re-index it in real time) if we wiped it clean and tried to do that from scratch. But in this incremental mode it appears to be working properly. Also, this particular collection is much easier on the server indexing-wise, since we are harvesting in |
Hey @landreev. Would you mind if I tell Harvard Dataverse to harvest from SRDA each week? The schedule for the harvesting client is set to "none" right now. I'd consider this GitHub issue completed if we knew that we could update records from SRDA on a regular basis. |
@jggautier Yes, sure. |
@jggautier BTW, a minor thing I noticed about SRDA content: it looks like there is an occasional delay between their new documents being released, and the DOIs getting registered for them. In practical terms this means that after we harvest from them, the redirects are not immediately working for these newest records (appearing at the top of the collection) for the first couple of days. |
Ah okay, think it's worth asking if they're aware of this and how it might affect discovery? Also I told HDV to harvest from SRDA every Sunday at 12am. So I'm all for considering this GitHub issue completed and closing it. |
Just following up that I emailed Chucky from SRDA with the good news about the improved harvesting and asked about that delay you noticed. The emails are at https://help.hmdc.harvard.edu/Ticket/Display.html?id=287243#txn-8459551. |
After Dataverse v6.1 is applied to Harvard Dataverse, we'll need to re-harvest SRDA's metadata into the collection at https://dataverse.harvard.edu/dataverse/srda_harvested
OAI URL:
https://srda.sinica.edu.tw/oai_pmh/oai
Set:
None
This is following the work discussed in the GitHub issue at IQSS/dataverse#7624
The text was updated successfully, but these errors were encountered: