Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

existing segmentation does not load #280

Open
bertsky opened this issue Aug 30, 2021 · 10 comments
Open

existing segmentation does not load #280

bertsky opened this issue Aug 30, 2021 · 10 comments
Assignees
Labels
Priority: Critical Status: On Hold Indicates that work on the issue was put on hold Type: Bug Indicates an unexpected problem or unintended behavior.

Comments

@bertsky
Copy link

bertsky commented Aug 30, 2021

I believe I found a regression in the current version (if compared to 0.6-RC1 27dc5bc): the page's existing PAGE-XML does not load, a brief warning appears (saying that segments could not be loaded), then LAREX autosegments. There is no related error in the logs / stdout.

(I did see an error message with the following stack-trace the other day, but it does not seem related, time-wise:)

29-Aug-2021 05:58:45.789 INFO [http-nio-8080-exec-1] org.apache.coyote.http11.Http11Processor.service Error parsing HTTP request header
 Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level.
        java.lang.IllegalArgumentException: Invalid character found in method name [0x030x000x00/*0xe00x000x000x000x000x00Cookie: ]. HTTP method names must be tokens
                at org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:419)
                at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:269)
                at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
                at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893)
                at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1723)
                at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
                at java.lang.Thread.run(Thread.java:748)

@chaddy314 chaddy314 self-assigned this Aug 30, 2021
@maxnth maxnth added the Type: Bug Indicates an unexpected problem or unintended behavior. label Aug 30, 2021
@chaddy314
Copy link
Member

Does this happen in mets or legacy mode?
If it happens in mets mode: in which fileGrp (with mimeType) does it happen?

@bertsky
Copy link
Author

bertsky commented Aug 30, 2021

In legacy mode (library shows flat).

Sry, forgot to attach example data – here it is.
179-heizkostenabrechnung_01.06.2018-31.05.2019-page01.zip

@chaddy314
Copy link
Member

chaddy314 commented Aug 30, 2021

Due to the . in the filename Larex detects everything after it as a SubExtension and subsequently cuts everything off after that to determine the filename for its PAGE-XML.

A quick workaround for this problem would be to use a different character in dates (or changing the name of the old PAGE-XML to match Larex' expected format).

@bertsky
Copy link
Author

bertsky commented Sep 2, 2021

Ah, many thanks – did not notice that crucial difference to all the other files (which were fine). Indeed, the workaround is trivial.

@bertsky
Copy link
Author

bertsky commented Sep 2, 2021

Still, it would be great if LAREX was smarter on suffix detection. I often have files imported from PDF or multi-page TIFF which discerns pages via a NAME.PAGE.tif scheme...

@bertsky
Copy link
Author

bertsky commented Sep 2, 2021

Still, it would be great if LAREX was smarter on suffix detection. I often have files imported from PDF or multi-page TIFF which discerns pages via a NAME.PAGE.tif scheme...

The old version already covered this case IIRC.

@maxnth
Copy link
Member

maxnth commented Sep 2, 2021

Still, it would be great if LAREX was smarter on suffix detection.

I fully agree, it's more restrictive and "complex" than it needs to be, we'll definitely look into this.

@bertsky
Copy link
Author

bertsky commented Sep 2, 2021

Oh, and to make matters worse: commas are not allowed either. Even in METS mode – the open dialog looks good but does not succeed, because the directory gets split along , and only the last part survives, which yields (for a path Nachrichten_aus_der_Bruder-Gemeine,_1819,_No._01 and fileGrp TEXT):

java.io.FileNotFoundException: /usr/local/tomcat/_No._01/TEXT/TEXT_0001.xml (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
        at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:623)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:148)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:806)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
        at de.uniwue.web.io.MetsReader.parseXML(MetsReader.java:95)
        at de.uniwue.web.io.MetsReader.getImagePathFromPage(MetsReader.java:102)
        at de.uniwue.web.controller.ViewerController.direct(ViewerController.java:119)

@bertsky
Copy link
Author

bertsky commented Nov 10, 2021

Another effect of the additional dots/commas in the filenames besides the segmentation not loading (now fixed?) or the open dialog not concluding (commas in bookdir?) is that only the last page among each subset will show up (e.g. only *.0003.tif if you actually have *.0000.tif up to *.0003.tif).

@maxnth maxnth added Status: On Hold Indicates that work on the issue was put on hold Priority: Low labels Mar 14, 2022
@bertsky
Copy link
Author

bertsky commented Feb 8, 2023

@maxnth low priority really? Goobi and Kitodo for example produce paths like AlbuRounC_1666480371_04150_tif/jpegs/00000001.tif.small.jpg all the time. These nested suffixes still break here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Critical Status: On Hold Indicates that work on the issue was put on hold Type: Bug Indicates an unexpected problem or unintended behavior.
Projects
None yet
Development

No branches or pull requests

3 participants