Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimeMachine: xml2sql Java.lang.NullPointerException #225

Open
FelixDrinkall opened this issue Aug 16, 2023 · 2 comments
Open

TimeMachine: xml2sql Java.lang.NullPointerException #225

FelixDrinkall opened this issue Aug 16, 2023 · 2 comments

Comments

@FelixDrinkall
Copy link

I am getting the following error:

INFO [main] (Log4jLogger.java:28) - Pagelinks 1527700000
INFO [main] (Log4jLogger.java:28) - Processing the text table
Exception in thread "xml2sql" java.lang.NullPointerException at de.tudarmstadt.ukp.wikipedia.wikimachine.dump.sql.SQLEscape.escape(SQLEscape.java:37) at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.TextWriter.writeRevision(TextWriter.java:55) at de.tudarmstadt.ukp.wikipedia.mwdumper.importer.PageFilter.writeRevision(PageFilter.java:67) at de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.closeRevision(AbstractXmlDumpReader.java:548) at de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.endElement(AbstractXmlDumpReader.java:338) at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:610) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1718) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2883) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605) at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534) at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888) at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824) at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216) at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635) at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:324) at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:197) at de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.AbstractXmlDumpReader.readDump(AbstractXmlDumpReader.java:205) at de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.XMLDumpTableInputStreamThread.run(XMLDumpTableInputStreamThread.java:90)
INFO [main] (Log4jLogger.java:28) - Write end dead

java.base/java.io.PipedInputStream.read(PipedInputStream.java:310)
java.base/java.io.PipedInputStream.read(PipedInputStream.java:377)
java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:271)
de.tudarmstadt.ukp.wikipedia.timemachine.dump.xml.XMLDumpTableInputStream.read(XMLDumpTableInputStream.java:83)
java.base/java.io.DataInputStream.readInt(DataInputStream.java:392)
de.tudarmstadt.ukp.wikipedia.wikimachine.util.UTFDataInputStream.readUTFAsArray(UTFDataInputStream.java:73)
de.tudarmstadt.ukp.wikipedia.wikimachine.dump.xml.TextParser.next(TextParser.java:69)
de.tudarmstadt.ukp.wikipedia.wikimachine.domain.DumpVersionProcessor.processText(DumpVersionProcessor.java:153)
de.tudarmstadt.ukp.wikipedia.timemachine.domain.TimeMachineGenerator.processInputDumps(TimeMachineGenerator.java:133)
de.tudarmstadt.ukp.wikipedia.timemachine.domain.TimeMachineGenerator.start(TimeMachineGenerator.java:109)
de.tudarmstadt.ukp.wikipedia.timemachine.domain.JWPLTimeMachine.main(JWPLTimeMachine.java:83)

The command is:
java -Djdk.xml.totalEntitySizeLimit=2147483647 -Xmx512m -cp ".:./log4j.properties:./*" de.tudarmstadt.ukp.wikipedia.timemachine.domain.JWPLTimeMachine config.xml

The config.xml file is:

This a configuration for the JWPL TimeMachine english Contents Disambiguation_pages 20060101000000 20060102000000 1 /nethome/felixd/wiki_timemachine/wiki_raw/enwiki-latest-pages-meta-history1.xml-p1p844.bz2 /nethome/felixd/wiki_timemachine/wiki_raw/enwiki-latest-categorylinks.sql.gz /nethome/felixd/wiki_timemachine/wiki_raw/enwiki-latest-pagelinks.sql.gz /nethome/felixd/wiki_timemachine/wiki_formatted false

The command seems to be returning the first wiki entry and then fails. In my output directory, I have a PageMapLine.txt with the following entry:
11286 Fruitarianism 11286 NULL NULL

And I have a Page.txt file:
11286 11286 Fruitarianism [[Image:Fruit.jpg|frame|right|A selection...............

What is going on? Should I edit the SQLEscape file in the jar?

@zesch
Copy link
Member

zesch commented Aug 17, 2023

Please note that this library is not really maintained anymore and has not been tested with newer dumps for at least 5 years.
Having said that: it won't work with -Xmx512m, you will need at least 4g if not more. And a lot of additional space on the hard disk.

@reckart
Copy link
Member

reckart commented Oct 20, 2023

@FelixDrinkall JWPL is just getting an update - you might check if the current main branch works for your use-case. Note though that package names and groupId/artifactIds have changed... the next version will be 2.0.0 with breaking changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants