Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BookCModel and PDFs #119

Open
kromabiles opened this issue Jul 14, 2020 · 5 comments
Open

BookCModel and PDFs #119

kromabiles opened this issue Jul 14, 2020 · 5 comments

Comments

@kromabiles
Copy link

kromabiles commented Jul 14, 2020

Hello Diego,

As you know, our IR is currently exploring ways to use the IMI to ingest/create BookCModel objects from PDFs. Since the IMI is the main tool we rely on for ingesting content into Islandora, could we explore/test some possible options/solutions for a way to implement a simpler pdf to tiff image capability? Some of our current PDF objs that we'd like to ingest as books have 50+ pages, which would then need to be divided and converted from pdfs to tiffs. My brain hurts.

More than happy to bounce off ideas and do some testing with you. :)

Best,
Katie

@DiegoPino
Copy link
Contributor

@kromabiles great. Following up here. A few questions about this:

  • You good with IMI extracting PDF into TIFFs? Or do you want IMI to use the same config/Book reader already uses?
  • How do we deal with page level metadata? I see we have two options (both could be implemented)
    • You actually create the rows for each page, and i find some clever UI way of letting IMI know it should only fill the OBJ from the parent column PDF (extracted as TIFF). This also means if you add 10 instead of, e.g 100 pages that the PDF contains it would only ingest 10.
  • You add nothing. If so, then IMI will create the most basic Metadata for you, basically just the title and the page number.

Processing of this would need to actually happen during ingest (batch) or it would be just too slow... we need to test.

What is your largest PDF around there?

Secondly. I will also enable a Digital object with the same directly on play.archipelago.nyc so we can test performance and compare.

Thanks!

@kromabiles
Copy link
Author

@DiegoPino Yes, extracting PDFs into TIFFs would be great. Our book collections don't have any page level metadata - all structured as single object description. :/

Our largest PDF is about 3GB and consists of 93 pages (yearbook).

Seeing Archipelago in action sounds exciting! :)

@DiegoPino
Copy link
Contributor

Excellent. I will start planning. Will probably borrow book module settings, but i feel i should go TIFF first and the compress to JP2 if needed. I just tested a JP2 generated by islandora (core) and it was 25 Mbytes in size, same TIFF was 10 Mbytes which was a little bit annoying!

@DiegoPino
Copy link
Contributor

@kromabiles sorry for the slowness, i have a solution! But requires some testing, planning. Give me the end of the week to enable in our sandbox and i give you credentials there. Will also copy your Templates and prepare a spreadsheet testcase, but even better if you have a few PDFs in a zip and a demo spreadsheet around

@kromabiles
Copy link
Author

No worries! Thanks, Diego - files are too big to attach here, so I'll send them over to you via email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants