Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension: Rename Images to fit GT-File by name #3

Open
M3ssman opened this issue Nov 6, 2023 · 6 comments
Open

Extension: Rename Images to fit GT-File by name #3

M3ssman opened this issue Nov 6, 2023 · 6 comments

Comments

@M3ssman
Copy link

M3ssman commented Nov 6, 2023

Description

Actually, when images files ere referenced in mets.xml in group , the get downloaded an pushed to directory.
This way, the naming similarity between image and GT-data is lost. But this similarity is a key requirement for tools like Transkribus or LAREX to match image with GT-data for further corrections or extensions.

Even worse, because our data consists of a overall sample of 40.000+ prints, it includes for example several images named "00000008.jpg" which could overwrite each other.

@M3ssman
Copy link
Author

M3ssman commented Nov 6, 2023

Proposal: each image file is linked to a physical page.
This container has an attribute CONTENTIDS which holds the requested image URN which corresponds the OCR-data URN with all :-characters being replaced by + and a final underscore with the language ISO-code attached.

@M3ssman
Copy link
Author

M3ssman commented Nov 9, 2023

For a test sample cf https://github.com/M3ssman/gt-test/releases/tag/v2.1.2 , where for 128 GT-Pages (Latin) only 108 images got included, since 20 images collide.

@tboenig
Copy link
Collaborator

tboenig commented Nov 23, 2023

The problem is solved with the update of ocr-d (Bagit). Also use the new action workflow.

Regards tboenig

@tboenig tboenig closed this as completed Nov 23, 2023
@M3ssman
Copy link
Author

M3ssman commented Dec 1, 2023

Re-open, some more review required.

@tboenig tboenig reopened this Dec 5, 2023
@kba
Copy link
Member

kba commented Dec 6, 2023

IIUC this behavior should be fixed since v2.59.0, the relevant PR is OCR-D/core#1137.

Is there a regression, are you still experiencing files being overwritten when bagging?

@M3ssman
Copy link
Author

M3ssman commented Dec 11, 2023

@kba I'll take a closer look at the relevant modifications and try this out with our custom setup ASAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants