Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-flat directory structures (especially regarding ocr-d interoperability) #240

Open
maxnth opened this issue Nov 30, 2020 · 11 comments
Assignees
Labels
Status: Testing Needed Indicates that the implemented feature or bug fix need further manual testing or test coverage Type: Feature Indicates a feature request

Comments

@maxnth
Copy link
Member

maxnth commented Nov 30, 2020

As already proposed in #160 it should be possible to work with non-flat directory structures, especially regarding interoperability with ocr-d.

The first and probably essential step would be to rewrite the way direct requests work. Instead of supplying as single bookpath, direct requests should expect e.g. a map of images with image variants (+ possibly additional information).

When this is implemented writing a module which reads e.g. image variants and segmentation from an ocr-d METS file and passing it on to LAREX should become a lot easier than with our current implementation.

This would also allow the enhancement for OCR4all described here.

@maxnth maxnth added the Type: Enhancement Indicates an enhancement proposal for an existing feature label Nov 30, 2020
@chaddy314 chaddy314 mentioned this issue Jun 14, 2021
@maxnth
Copy link
Member Author

maxnth commented Jul 14, 2021

Above described ability to directly read images / XML files from ocr-d METS files (as well as an overhaul of the data structure required for direct requests) was added in d675aab

@maxnth maxnth closed this as completed Jul 14, 2021
@maxnth maxnth reopened this Jul 14, 2021
@maxnth maxnth added the Status: Testing Needed Indicates that the implemented feature or bug fix need further manual testing or test coverage label Jul 14, 2021
@bertsky
Copy link

bertsky commented Jul 23, 2021

I have just started testing the new dev with all the new features (book import dialog with page selection and METS support, preserving PAGE reader/writer, metadata editor, book reload, fullscreen, image rotation etc). It's absolutely awesome! (Sorry I did not find the time for participating with reviews and tests earlier.)

However, there seem to be some issues with the METS support. I am still trying to fully grasp why my OCR-D workspaces won't open, but the following block in the library view seems odd:

/*
When images are loaded from each pagexml, instead of directly from mets( or legacy),
the value of each imageMap.entry is a xmlPath instead of an imagePath. This value has to be changed to
the imagePath read from the given xmlPath.
*/
if(determineType(imagePathList.get(0))) {
xmlPath = imagePathList.get(0);
String parentFolder = new File(xmlPath).getParentFile().getParentFile().getAbsolutePath();
imagePathList = MetsReader.getImagePathFromPage(xmlPath);
/*
Correct absolute paths for images as they are not constrained to pageXML.parent
and described as relative to metsXml.root
*/
List<String> correctedPathList = new ArrayList<>();
for(String imagePath : imagePathList) {
correctedPathList.add(parentFolder + File.separator + imagePath);
}
entry.setValue(correctedPathList);
xmlmap.put(imageName.split("\\.")[0], xmlPath);
} else {

I wonder:

  • why determine the type (PAGE vs image) via filename extension (and not MIME type), and why only by looking at the first file? (I would assume the MetsReader parses the structMap for physical pages, and then gets a single file from the chosen fileGrp per pageId. That file should be a PAGE if possible, but could also be an image – if no annotation exists for that pageId in that fileGrp yet. What makes matters worse is that by OCR-D spec there can be PAGE files as well as derived images in the same fileGrp – the latter for AlternativeImages of that PAGE. See OCR-D core for an implementation – just ignore the multi-fileGrp semantics.)
  • why combine the image file path from the PAGE path plus the /PcGts/Page/@imageFilename or //AlternativeImage/@filename, contrary to what the comment says (metsXml.root instead of pageXML.parent), and contrary to the OCR-D spec, which says image paths are relative to the METS and not relative to the PAGE location? (If this is intentional, you are not alone: PageViewer uses the same convention. But it does make life very difficult for OCR-Ders.)

LAREX used to have an inconsistency around the path resolution:

  • when reading pages, it would use the filesystem path (image suffix in the flat directory structure)
  • when writing pages, it would use the @imageFilename path (which could be different)

I have not even looked at the writing side for now. But I can imagine it is difficult to get consistency when you want to support both the old flat and the new METS bookpaths.

@bertsky
Copy link

bertsky commented Jul 23, 2021

  • why combine the image file path from the PAGE path plus the /PcGts/Page/@imageFilename or //AlternativeImage/@filename, contrary to what the comment says (metsXml.root instead of pageXML.parent), and contrary to the OCR-D spec, which says image paths are relative to the METS and not relative to the PAGE location?

Looking more closely, the current implementation is neither the one nor the other (PAGE directory or METS directory), but a different beast: the PAGE directory's parent. Of course, the latter two often coincide, but as soon as the PAGE lives in the root level, or in a directory deeper than one level below the METS, then it won't work.

@bertsky
Copy link

bertsky commented Jul 23, 2021

Another observation is that the order of the bookpath in the library view has changed. It used to be sorted alphabetically (which is best for many purposes I guess), but now looks random. Ideally, we could click on the column titles to have it sorted by name or date...

@maxnth
Copy link
Member Author

maxnth commented Jul 26, 2021

Something seemed to have gone wrong during one of the last pushes into dev, this worked prior to this and I can reproduce it in the current dev. @chaddy314 or me will try to fix this ASAP.
Thanks for the report and sorry for the inconvenience.

@chaddy314
Copy link
Member

why determine the type (PAGE vs image) via filename extension (and not MIME type), and why only by looking at the first file? (I would assume the MetsReader parses the structMap for physical pages, and then gets a single file from the chosen fileGrp per pageId. That file should be a PAGE if possible, but could also be an image – if no annotation exists for that pageId in that fileGrp yet. What makes matters worse is that by OCR-D spec there can be PAGE files as well as derived images in the same fileGrp – the latter for AlternativeImages of that PAGE. See OCR-D core for an implementation – just ignore the multi-fileGrp semantics.)

Internally MIMETYPE of each fileGrp is already being processed in MetsReader. Having MIME as an extra parameter for directrequest would indeed be a cleaner solution. Especially if the possibility exists that images could be in application/vnd.prima.page+xml

Looking more closely, the current implementation is neither the one nor the other (PAGE directory or METS directory), but a different beast: the PAGE directory's parent. Of course, the latter two often coincide, but as soon as the PAGE lives in the root level, or in a directory deeper than one level below the METS, then it won't work.

This seems to be an overlooked relic from the early stages of the implementation and will be fixed ASAP.

Another observation is that the order of the bookpath in the library view has changed. It used to be sorted alphabetically (which is best for many purposes I guess), but now looks random. Ideally, we could click on the column titles to have it sorted by name or date...

Alphabetical sorting will be pushed soon to the current pull request. Sorting by column is a really nice suggestion I'm considering implementing after this upcoming release.

Thanks for the detailed report!

@maxnth
Copy link
Member Author

maxnth commented Jul 27, 2021

Sorting by column is a really nice suggestion I'm considering implementing after this upcoming release.

Using something like DataTables should make this pretty easy.

@maxnth
Copy link
Member Author

maxnth commented Jul 30, 2021

@bertsky Could you send us the OCR-D workspace which failed to open for you so that we could test it locally? 😁

@bertsky
Copy link

bertsky commented Jul 30, 2021

@maxnth unfortunately, I don't have it anymore. I'll recreate it and test out the new RC myself (but not before Aug 16, sry)

@bertsky
Copy link

bertsky commented Oct 6, 2021

BTW the current implementation is still incompatible with fileGrps that use the OCR-D convention for storing derived images (i.e. inside the same grp, under the same page id, but merely an image mime type).

On these, I cannot get past the Open Book dialog. It just stays there, regardless how often I press NEXT. And stderr shows:

org.springframework.web.servlet.handler.AbstractHandlerExceptionResolver.resolveException Resolved [org.springframework.web.bind.MissingServletRequestParameterException: Required String parameter 'fileMap' is not present]

(Like explained above, the proper algorithm must iterate the fileGrp by structMap page ids and pick the maximum mimetype, i.e. PAGE or single image.)

@maxnth maxnth added Type: Feature Indicates a feature request and removed Type: Enhancement Indicates an enhancement proposal for an existing feature labels Mar 14, 2022
@bertsky
Copy link

bertsky commented Jan 31, 2024

Note: spaces and dots in the directory of the book will also break the METS reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Testing Needed Indicates that the implemented feature or bug fix need further manual testing or test coverage Type: Feature Indicates a feature request
Projects
None yet
Development

No branches or pull requests

3 participants