-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to pull OAI-PMH metadata (and view records) #664
Comments
An update: Using the harvest (https://github.com/vphill/pyoaiharvester) as it was not working because https://digitalcollections.lib.utk.edu/ still has the HTTP Auth (where you need to enter the username and password before you can see the site). Modifying the harvester's code to allow you to pass it in via the URL like http://username:[email protected]/ will get past that issue. However, we are quickly met with a different issue. There seems to be an issue with the SSL certificates. Details in this report. Rob said he will address the cert issue. After that is fixed we can try the harvester tool again. |
To summarize the findings here: There are two things at play here that makes the
|
Thanks to @kirkkwang, I was able to successfully pull records in oai_dc and mods format. I'll comment back once I inspect these files more. |
This may need to be a separate ticket, but I'm finding some odd records in the OAI. For instance:
AND
I'm including @laritakr here for informational purposes. None of these types of resources (OBJ, Transcript, OCR, etc) should be present in OAI. We just want the main record. Getting rid of all of these extra records would also make pulling OAI a lot faster. Ultimately the issue is that I would need to find a way to exclude all of these extra attachments for DPLA ingests etc. if not removed as this information is not needed. |
Here's another odd record:
|
HOCR also is not something we want a record for:
|
Restricting the types of works that show in your OAI feed will should be a new ticket, as it is separate from the requirements of this ticket. This is due to the way the child works are created to allow additional metadata for file sets in your repo. We will need to identify which specific information we need to exclude and override standard OAI behavior. |
@kirkkwang - I was able to pull both MODS and DC. Given that all of the sets have to be pulled each time, right now the time needed to get OAI is a bit restrictive, but this will be addressed when the ability to pull separate collections is added (#680). I approve the work completed in this ticket. |
Great! thank you @mlhale7 |
@mlhale7 and @kirkkwang I'm getting a "We're sorry, something went wrong" when trying to pull the mods for a record. |
@kirkkwang @josh-morgan117 I'm not able to pull OAI-PMH using the same process I was before. I'm getting the error:
Is it possible that the cert is having another issue? |
Story
ref. #665
I am unable to pull metadata using OAI-PMH with my regular tools (https://github.com/vphill/pyoaiharvester) and I do not see individual records when navigating the feed (https://digitalcollections.lib.utk.edu/catalog/oai) in the browser. There appears to be an identifier issue that is causing no records to be retrievable.
Acceptance Criteria
Screenshots / Video
When I use pyoaiharvester, I get a ZeroDivisionError. Here's a screenshot showing the command and error:
When I click on "oai_dc" in the browser to retrieve an individual record, I get the error "idDoesNotExist." Here's a screenshot:
Finally, looking in the browser at https://digitalcollections.lib.utk.edu/catalog/oai, I am seeing records for attachments that I would not expect (PRESERVE, MODS, etc.) We just want a single record to appear for each digital asset.
Testing Instructions and Sample Files
Notes
A conjecture - potentially this issue was introduced when we changed the URL to "digitalcollections.lib.utk.edu"?
The text was updated successfully, but these errors were encountered: