Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO 3.x support? #179

Open
aploshay opened this issue Aug 20, 2019 · 3 comments
Open

ALTO 3.x support? #179

aploshay opened this issue Aug 20, 2019 · 3 comments
Assignees

Comments

@aploshay
Copy link

It looks like this is currently using ALTO 2.0, but my tesseract is generating ALTO 3.x -- does ALTO 3 work with all the newspaper_works requirements, or are any changes required?

@ebenenglish
Copy link
Collaborator

Hi @aploshay, can you provide a bit more context about your use case?

Are you trying to use NewspaperWorks to ingest newspapers content that already has been OCR'd via tesseract?

@aploshay
Copy link
Author

aploshay commented Sep 5, 2019

Hi @ebenenglish, sorry for the belated reply:

In the ESSI project, we're using NewspaperWorks components to generate word boundaries to support text search within a scanned document, and do want to support newspapers that have already been OCR'ed via tesseract. My local tesseract produces ALTO 3 output, which seems to work fine (at least for parsing word boundaries), but since it's a different major version I didn't know what might break or if there were reasons to hold back to 2.x.

@ebenenglish
Copy link
Collaborator

Hi @aploshay, sorry for the delay in responding. We haven't done any testing with ALTO 3 output on our end, or any research about the major differences.

But it sounds like ALTO 3 is similar enough to ALTO 2 that our current conversion process is working OK, which is encouraging! Though now I see that ALTO 4.1 is the current version...

https://www.loc.gov/standards/alto/

Our target of ALTO 2 is based on what LofC is mandating in their digitization specs for the NDNP program. Even recently uploaded batches from 2019 seem to still be using ALTO 2 (ex: https://chroniclingamerica.loc.gov/data/batches/arhi_electabuzz_ver01/), and the NDNP Digitization Guidelines are still mandating ALTO 2 (see page 9 here: https://www.loc.gov/ndnp/guidelines/NDNP_201921TechNotes.pdf).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants