-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALTO 3.x support? #179
Comments
Hi @aploshay, can you provide a bit more context about your use case? Are you trying to use NewspaperWorks to ingest newspapers content that already has been OCR'd via tesseract? |
Hi @ebenenglish, sorry for the belated reply: In the ESSI project, we're using NewspaperWorks components to generate word boundaries to support text search within a scanned document, and do want to support newspapers that have already been OCR'ed via tesseract. My local tesseract produces ALTO 3 output, which seems to work fine (at least for parsing word boundaries), but since it's a different major version I didn't know what might break or if there were reasons to hold back to 2.x. |
Hi @aploshay, sorry for the delay in responding. We haven't done any testing with ALTO 3 output on our end, or any research about the major differences. But it sounds like ALTO 3 is similar enough to ALTO 2 that our current conversion process is working OK, which is encouraging! Though now I see that ALTO 4.1 is the current version... https://www.loc.gov/standards/alto/ Our target of ALTO 2 is based on what LofC is mandating in their digitization specs for the NDNP program. Even recently uploaded batches from 2019 seem to still be using ALTO 2 (ex: https://chroniclingamerica.loc.gov/data/batches/arhi_electabuzz_ver01/), and the NDNP Digitization Guidelines are still mandating ALTO 2 (see page 9 here: https://www.loc.gov/ndnp/guidelines/NDNP_201921TechNotes.pdf). |
It looks like this is currently using ALTO 2.0, but my tesseract is generating ALTO 3.x -- does ALTO 3 work with all the newspaper_works requirements, or are any changes required?
The text was updated successfully, but these errors were encountered: