Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLStreamException exception when parsing revision dumps #243

Open
Tpt opened this issue Sep 24, 2016 · 3 comments
Open

XMLStreamException exception when parsing revision dumps #243

Tpt opened this issue Sep 24, 2016 · 3 comments

Comments

@Tpt
Copy link
Collaborator

Tpt commented Sep 24, 2016

When parsing some big revision dumps this error is raised:

sept. 23, 2016 7:05:46 PM org.wikidata.wdtk.dumpfiles.MwRevisionDumpFileProcessor processDumpFileContents
SEVERE: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[291117,14066]
Message: JAXP00010004: The accumulated size of entities is "50 000 001" that exceeded the "50 000 000" limit set by "FEATURE_SECURE_PROCESSING".

It is probably because the XML files contains to many nodes with respect to the limit set by XML secure processing. We should probably set XMLConstants.FEATURE_SECURE_PROCESSING option to false as Wikimedia dump should be trustable.

@jimkont
Copy link

jimkont commented Oct 21, 2016

Maybe our solution works here as well
dbpedia/extraction-framework#487 (comment)

@mkroetzsch
Copy link
Member

@jimkont Thanks for pointing this out.

@Tpt If you have time to try this, it would be very much appreciated.

@Tpt
Copy link
Collaborator Author

Tpt commented Nov 6, 2016

It works great using the parameter -Djdk.xml.entityExpansionLimit=2147480000 when running JARs that uses WDTK.

We should maybe find a way to integrate it inside WDTK in order don't have to set this parameter all the time

Tpt added a commit to askplatypus/platypus-kb-lucene that referenced this issue Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants