existing segmentation does not load #280

bertsky · 2021-08-30T13:10:29Z

I believe I found a regression in the current version (if compared to 0.6-RC1 27dc5bc): the page's existing PAGE-XML does not load, a brief warning appears (saying that segments could not be loaded), then LAREX autosegments. There is no related error in the logs / stdout.

(I did see an error message with the following stack-trace the other day, but it does not seem related, time-wise:)

29-Aug-2021 05:58:45.789 INFO [http-nio-8080-exec-1] org.apache.coyote.http11.Http11Processor.service Error parsing HTTP request header
 Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
        java.lang.IllegalArgumentException: Invalid character found in method name [0x030x000x00/*0xe00x000x000x000x000x00Cookie: ]. HTTP method names must be tokens
                at org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:419)
                at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:269)
                at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
                at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893)
                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1723)
                at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
                at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

chaddy314 · 2021-08-30T14:20:23Z

Does this happen in mets or legacy mode?
If it happens in mets mode: in which fileGrp (with mimeType) does it happen?

bertsky · 2021-08-30T14:27:22Z

In legacy mode (library shows flat).

Sry, forgot to attach example data – here it is.
179-heizkostenabrechnung_01.06.2018-31.05.2019-page01.zip

chaddy314 · 2021-08-30T14:56:35Z

Due to the . in the filename Larex detects everything after it as a SubExtension and subsequently cuts everything off after that to determine the filename for its PAGE-XML.

A quick workaround for this problem would be to use a different character in dates (or changing the name of the old PAGE-XML to match Larex' expected format).

bertsky · 2021-09-02T12:05:26Z

Ah, many thanks – did not notice that crucial difference to all the other files (which were fine). Indeed, the workaround is trivial.

bertsky · 2021-09-02T12:18:23Z

Still, it would be great if LAREX was smarter on suffix detection. I often have files imported from PDF or multi-page TIFF which discerns pages via a NAME.PAGE.tif scheme...

bertsky · 2021-09-02T12:22:23Z

Still, it would be great if LAREX was smarter on suffix detection. I often have files imported from PDF or multi-page TIFF which discerns pages via a NAME.PAGE.tif scheme...

The old version already covered this case IIRC.

maxnth · 2021-09-02T12:57:49Z

Still, it would be great if LAREX was smarter on suffix detection.

I fully agree, it's more restrictive and "complex" than it needs to be, we'll definitely look into this.

bertsky · 2021-09-02T12:58:31Z

Oh, and to make matters worse: commas are not allowed either. Even in METS mode – the open dialog looks good but does not succeed, because the directory gets split along , and only the last part survives, which yields (for a path Nachrichten_aus_der_Bruder-Gemeine,_1819,_No._01 and fileGrp TEXT):

java.io.FileNotFoundException: /usr/local/tomcat/_No._01/TEXT/TEXT_0001.xml (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
        at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:623)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:148)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:806)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
        at de.uniwue.web.io.MetsReader.parseXML(MetsReader.java:95)
        at de.uniwue.web.io.MetsReader.getImagePathFromPage(MetsReader.java:102)
        at de.uniwue.web.controller.ViewerController.direct(ViewerController.java:119)

bertsky · 2021-11-10T09:26:13Z

Another effect of the additional dots/commas in the filenames besides the segmentation not loading (now fixed?) or the open dialog not concluding (commas in bookdir?) is that only the last page among each subset will show up (e.g. only *.0003.tif if you actually have *.0000.tif up to *.0003.tif).

bertsky · 2023-02-08T21:02:52Z

@maxnth low priority really? Goobi and Kitodo for example produce paths like AlbuRounC_1666480371_04150_tif/jpegs/00000001.tif.small.jpg all the time. These nested suffixes still break here.

chaddy314 self-assigned this Aug 30, 2021

maxnth added the Type: Bug Indicates an unexpected problem or unintended behavior. label Aug 30, 2021

This was referenced Sep 18, 2021

Hotfix/ocr4all interface #287

Closed

small fixes for OCR4all interface #288

Merged

maxnth added Status: On Hold Indicates that work on the issue was put on hold Priority: Low labels Mar 14, 2022

maxnth added Priority: High Priority: Critical and removed Priority: Low Priority: High labels Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

existing segmentation does not load #280

existing segmentation does not load #280

bertsky commented Aug 30, 2021 •

edited

Loading

chaddy314 commented Aug 30, 2021

bertsky commented Aug 30, 2021

chaddy314 commented Aug 30, 2021 •

edited

Loading

bertsky commented Sep 2, 2021

bertsky commented Sep 2, 2021

bertsky commented Sep 2, 2021

maxnth commented Sep 2, 2021

bertsky commented Sep 2, 2021

bertsky commented Nov 10, 2021

bertsky commented Feb 8, 2023

existing segmentation does not load #280

existing segmentation does not load #280

Comments

bertsky commented Aug 30, 2021 • edited Loading

chaddy314 commented Aug 30, 2021

bertsky commented Aug 30, 2021

chaddy314 commented Aug 30, 2021 • edited Loading

bertsky commented Sep 2, 2021

bertsky commented Sep 2, 2021

bertsky commented Sep 2, 2021

maxnth commented Sep 2, 2021

bertsky commented Sep 2, 2021

bertsky commented Nov 10, 2021

bertsky commented Feb 8, 2023

bertsky commented Aug 30, 2021 •

edited

Loading

chaddy314 commented Aug 30, 2021 •

edited

Loading