Support non-flat directory structures (especially regarding ocr-d interoperability) #240

maxnth · 2020-11-30T10:58:41Z

As already proposed in #160 it should be possible to work with non-flat directory structures, especially regarding interoperability with ocr-d.

The first and probably essential step would be to rewrite the way direct requests work. Instead of supplying as single bookpath, direct requests should expect e.g. a map of images with image variants (+ possibly additional information).

When this is implemented writing a module which reads e.g. image variants and segmentation from an ocr-d METS file and passing it on to LAREX should become a lot easier than with our current implementation.

This would also allow the enhancement for OCR4all described here.

maxnth · 2021-07-14T09:59:32Z

Above described ability to directly read images / XML files from ocr-d METS files (as well as an overhaul of the data structure required for direct requests) was added in d675aab

bertsky · 2021-07-23T22:48:01Z

I have just started testing the new dev with all the new features (book import dialog with page selection and METS support, preserving PAGE reader/writer, metadata editor, book reload, fullscreen, image rotation etc). It's absolutely awesome! (Sorry I did not find the time for participating with reviews and tests earlier.)

However, there seem to be some issues with the METS support. I am still trying to fully grasp why my OCR-D workspaces won't open, but the following block in the library view seems odd:

LAREX/src/main/java/de/uniwue/web/controller/ViewerController.java

Lines 249 to 268 in 8ccd341

    
           			/* 
        
           				When images are loaded from each pagexml, instead of directly from mets( or legacy), 
        
           				the value of each imageMap.entry is a xmlPath instead of an imagePath. This value has to be changed to 
        
           				the imagePath read from the given xmlPath. 
        
           			 */ 
        
           			if(determineType(imagePathList.get(0))) { 
        
           				xmlPath = imagePathList.get(0); 
        
           				String parentFolder = new File(xmlPath).getParentFile().getParentFile().getAbsolutePath(); 
        
           				imagePathList = MetsReader.getImagePathFromPage(xmlPath); 
        
           				/* 
        
           					Correct absolute paths for images as they are not constrained to pageXML.parent 
        
           					and described as relative to metsXml.root 
        
           				 */ 
        
           				List<String> correctedPathList = new ArrayList<>(); 
        
           				for(String imagePath : imagePathList) { 
        
           					correctedPathList.add(parentFolder + File.separator + imagePath); 
        
           				} 
        
           				entry.setValue(correctedPathList); 
        
           				xmlmap.put(imageName.split("\\.")[0], xmlPath); 
        
           			} else {

I wonder:

why determine the type (PAGE vs image) via filename extension (and not MIME type), and why only by looking at the first file? (I would assume the MetsReader parses the structMap for physical pages, and then gets a single file from the chosen fileGrp per pageId. That file should be a PAGE if possible, but could also be an image – if no annotation exists for that pageId in that fileGrp yet. What makes matters worse is that by OCR-D spec there can be PAGE files as well as derived images in the same fileGrp – the latter for AlternativeImages of that PAGE. See OCR-D core for an implementation – just ignore the multi-fileGrp semantics.)
why combine the image file path from the PAGE path plus the /PcGts/Page/@imageFilename or //AlternativeImage/@filename, contrary to what the comment says (metsXml.root instead of pageXML.parent), and contrary to the OCR-D spec, which says image paths are relative to the METS and not relative to the PAGE location? (If this is intentional, you are not alone: PageViewer uses the same convention. But it does make life very difficult for OCR-Ders.)

LAREX used to have an inconsistency around the path resolution:

when reading pages, it would use the filesystem path (image suffix in the flat directory structure)
when writing pages, it would use the @imageFilename path (which could be different)

I have not even looked at the writing side for now. But I can imagine it is difficult to get consistency when you want to support both the old flat and the new METS bookpaths.

bertsky · 2021-07-23T23:03:22Z

why combine the image file path from the PAGE path plus the /PcGts/Page/@imageFilename or //AlternativeImage/@filename, contrary to what the comment says (metsXml.root instead of pageXML.parent), and contrary to the OCR-D spec, which says image paths are relative to the METS and not relative to the PAGE location?

Looking more closely, the current implementation is neither the one nor the other (PAGE directory or METS directory), but a different beast: the PAGE directory's parent. Of course, the latter two often coincide, but as soon as the PAGE lives in the root level, or in a directory deeper than one level below the METS, then it won't work.

bertsky · 2021-07-23T23:15:09Z

Another observation is that the order of the bookpath in the library view has changed. It used to be sorted alphabetically (which is best for many purposes I guess), but now looks random. Ideally, we could click on the column titles to have it sorted by name or date...

maxnth · 2021-07-26T20:06:57Z

Something seemed to have gone wrong during one of the last pushes into dev, this worked prior to this and I can reproduce it in the current dev. @chaddy314 or me will try to fix this ASAP.
Thanks for the report and sorry for the inconvenience.

chaddy314 · 2021-07-26T21:19:30Z

why determine the type (PAGE vs image) via filename extension (and not MIME type), and why only by looking at the first file? (I would assume the MetsReader parses the structMap for physical pages, and then gets a single file from the chosen fileGrp per pageId. That file should be a PAGE if possible, but could also be an image – if no annotation exists for that pageId in that fileGrp yet. What makes matters worse is that by OCR-D spec there can be PAGE files as well as derived images in the same fileGrp – the latter for AlternativeImages of that PAGE. See OCR-D core for an implementation – just ignore the multi-fileGrp semantics.)

Internally MIMETYPE of each fileGrp is already being processed in MetsReader. Having MIME as an extra parameter for directrequest would indeed be a cleaner solution. Especially if the possibility exists that images could be in application/vnd.prima.page+xml

Looking more closely, the current implementation is neither the one nor the other (PAGE directory or METS directory), but a different beast: the PAGE directory's parent. Of course, the latter two often coincide, but as soon as the PAGE lives in the root level, or in a directory deeper than one level below the METS, then it won't work.

This seems to be an overlooked relic from the early stages of the implementation and will be fixed ASAP.

Another observation is that the order of the bookpath in the library view has changed. It used to be sorted alphabetically (which is best for many purposes I guess), but now looks random. Ideally, we could click on the column titles to have it sorted by name or date...

Alphabetical sorting will be pushed soon to the current pull request. Sorting by column is a really nice suggestion I'm considering implementing after this upcoming release.

Thanks for the detailed report!

maxnth · 2021-07-27T08:55:58Z

Sorting by column is a really nice suggestion I'm considering implementing after this upcoming release.

Using something like DataTables should make this pretty easy.

maxnth · 2021-07-30T07:05:47Z

@bertsky Could you send us the OCR-D workspace which failed to open for you so that we could test it locally? 😁

bertsky · 2021-07-30T12:51:05Z

@maxnth unfortunately, I don't have it anymore. I'll recreate it and test out the new RC myself (but not before Aug 16, sry)

bertsky · 2021-10-06T13:23:06Z

BTW the current implementation is still incompatible with fileGrps that use the OCR-D convention for storing derived images (i.e. inside the same grp, under the same page id, but merely an image mime type).

On these, I cannot get past the Open Book dialog. It just stays there, regardless how often I press NEXT. And stderr shows:

org.springframework.web.servlet.handler.AbstractHandlerExceptionResolver.resolveException Resolved [org.springframework.web.bind.MissingServletRequestParameterException: Required String parameter 'fileMap' is not present]

(Like explained above, the proper algorithm must iterate the fileGrp by structMap page ids and pick the maximum mimetype, i.e. PAGE or single image.)

bertsky · 2024-01-31T18:34:52Z

Note: spaces and dots in the directory of the book will also break the METS reader.

maxnth added the Type: Enhancement Indicates an enhancement proposal for an existing feature label Nov 30, 2020

maxnth assigned chaddy314 and maxnth Nov 30, 2020

chaddy314 added a commit that referenced this issue Jan 18, 2021

implements nonflat directory direct request according to #240

688684f

chaddy314 mentioned this issue Jun 14, 2021

Ocrd library #255

Merged

chaddy314 added a commit that referenced this issue Jun 30, 2021

implements nonflat directory direct request according to #240

d3b53ba

maxnth closed this as completed Jul 14, 2021

maxnth reopened this Jul 14, 2021

maxnth added the Status: Testing Needed Indicates that the implemented feature or bug fix need further manual testing or test coverage label Jul 14, 2021

bertsky mentioned this issue Oct 6, 2021

fixed missing textline orientation #289

Merged

maxnth added Type: Feature Indicates a feature request and removed Type: Enhancement Indicates an enhancement proposal for an existing feature labels Mar 14, 2022

maxnth mentioned this issue Feb 26, 2023

MetsReader and ImageLoader: support remote URLs #329

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-flat directory structures (especially regarding ocr-d interoperability) #240

Support non-flat directory structures (especially regarding ocr-d interoperability) #240

maxnth commented Nov 30, 2020

maxnth commented Jul 14, 2021

bertsky commented Jul 23, 2021 •

edited

Loading

bertsky commented Jul 23, 2021

bertsky commented Jul 23, 2021 •

edited

Loading

maxnth commented Jul 26, 2021

chaddy314 commented Jul 26, 2021

maxnth commented Jul 27, 2021

maxnth commented Jul 30, 2021

bertsky commented Jul 30, 2021

bertsky commented Oct 6, 2021

bertsky commented Jan 31, 2024

Support non-flat directory structures (especially regarding ocr-d interoperability) #240

Support non-flat directory structures (especially regarding ocr-d interoperability) #240

Comments

maxnth commented Nov 30, 2020

maxnth commented Jul 14, 2021

bertsky commented Jul 23, 2021 • edited Loading

bertsky commented Jul 23, 2021

bertsky commented Jul 23, 2021 • edited Loading

maxnth commented Jul 26, 2021

chaddy314 commented Jul 26, 2021

maxnth commented Jul 27, 2021

maxnth commented Jul 30, 2021

bertsky commented Jul 30, 2021

bertsky commented Oct 6, 2021

bertsky commented Jan 31, 2024

bertsky commented Jul 23, 2021 •

edited

Loading

bertsky commented Jul 23, 2021 •

edited

Loading