Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression in 0.7.6: @imageFilename relpath is resolved incorrectly #355

Closed
bertsky opened this issue Dec 13, 2024 · 4 comments
Closed

regression in 0.7.6: @imageFilename relpath is resolved incorrectly #355

bertsky opened this issue Dec 13, 2024 · 4 comments

Comments

@bertsky
Copy link

bertsky commented Dec 13, 2024

IIRC this was already in a usable state before the recent changes: LAREX needs to resolve the relative path of @imageFilename in a PAGE-XML. It must be clear to the user how to provide for that.

I now get errors like this:

java.lang.IllegalArgumentException: No image exists at /data/abel_leibmedicus_1699/../jpg/abel_leibmedicus_1699_0007.jpg

That happens when loading a PAGE fileGrp from the METS here.

So obviously, LAREX concatenated the @imageFilename="../jpg/abel_leibmedicus_1699_0007.jpg" with the path of the METS root instead of the directory of the PAGE-XML file itself.

I cannot think of a reasonable scenario where that would make sense.

Either you resolve via METS completely – then you have to look up the image mets:file via physical structmap – or you resolve it via PAGE – prefixing @imageFilename path with the PAGE directory.

So this seems to be a regression of the recent changes integrated by @maxnth – can you please take a look?

@bertsky
Copy link
Author

bertsky commented Dec 13, 2024

Wow, something superfishy is going on here. If I rewrite my PAGE files to appease LAREX, removing the ../ prefix in @imageFilename, I get the same dysfunct UI with an even stranger error message:

java.lang.IllegalArgumentException: No image exists at /data/abel_leibmedicus_1699/abel_leibmedicus_1699_0009.jpg

So despite the fact that now @imageFilename="jpg/abel_leibmedicus_1699_0007.jpg", this time LAREX does not concatenate this to the METS root, but tries to be smart.

@maxnth
Copy link
Member

maxnth commented Dec 13, 2024

I'm not sure this ever worked in LAREX with the data provided above as the logic for retrieving image paths from the @imageFilename attribute is that the path found there is relative to the METS root directory, see:

/*
Correct absolute paths for images as they are not constrained to pageXML.parent
and described as relative to metsXml.root
*/
List<String> absolutePathList = new ArrayList<>();
assert pageImagePathList != null;
for(String imagePath : pageImagePathList) {
if(!metsRoot.endsWith(File.separator)) { metsRoot += File.separator; }
absolutePathList.add(metsRoot + imagePath.replaceAll("\"" , ""));
}
imageMap.put(fileKey,absolutePathList);
xmlMap.put(fileKey, xmlPath);

and not relative to the PAGE XML file location (as you already noted in the second comment).
I didn't change any of that in my recent commits and if I'm not missing something this has worked this way for the METS mode since a while (if not ever).

I think I sent @chaddy314 (please chime in if wrong) the data from the OCR-D/assets repo back then when he implemented it, where that behaviour (Page @imageFilename relative to METS root) is always the case (https://github.com/OCR-D/assets/blob/ca108faf0e95cc823a9e84cd0a1602282ae006b1/data/kant_aufklaerung_1784-page-region-line-word_glyph/data/OCR-D-GT-PAGE/PAGE_0017_PAGE.xml#L8-L9). Both for the image filenames as well as the alternative images.

When removing the ../ prefix it actually loads correctly for me. See the attached video.

Screen.Recording.2024-12-13.at.11.20.48.mov

I guess you replaced the ../ in batch which doesn't help for certain pages (like the one in your error message) as the filename naming is inconsistent in the data you provided. Most of them have the ../ prefix, but not all, see e.g. https://github.com/OCR-D/gt_structure_1_1/blob/d90228a226a0163eccc9f6b3ac8706884b48daa0/data/abel_leibmedicus_1699/GT-PAGE/abel_leibmedicus_1699_0009.xml

I can also reproduce this for the alternative images as a whole in the data you provided, as they're noted in a "weird" way anyways, e. g. as <AlternativeImage filename="abel_leibmedicus_1699_0007_B.jpg.tif"/>, so neither relative to the PAGE XML file, nor relative to the METS root, nor available in the METS file. Not sure how one would successfully retrieve them.

The most robust way moving forward would probably be to use the paths from the METS file as the paths in there should always be relative to the METS root (?).

@bertsky
Copy link
Author

bertsky commented Dec 13, 2024

Thanks for your fast and thorough response @maxnth!

Sorry, I got confused about LAREX' expectations and did not remember this correctly.

Indeed, this is not a regression at all, not even something (the old version of) LAREX can do much about.

Yes, you are absolutely right the problem is that the prefix is not even consistent. So when my LAREX still did not load/show anything after removing ../, I wrongly assumed no pages are working (where actually it just happened that some were still incorrect, which looked the same on the UI).

I could fix this in the data by a smarter shell script. Will provide that as a PR for the GT repo.

Yes, I dimly remember now that we agreed on the OCR-D convention of having the same path in the PAGE as in the METS.

@bertsky bertsky closed this as completed Dec 13, 2024
@bertsky
Copy link
Author

bertsky commented Dec 14, 2024

see OCR-D/gt_structure_all#3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants