Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING" #487

Open
EllysahatNMI opened this issue Oct 20, 2016 · 20 comments

Comments

@EllysahatNMI
Copy link

Hi! I followed the step by step instruction but encountered this error after running
../clean-install-run
I used the enwiki-20161001-pages-articles.xml.bz2 and I know it's a huge file but how do I get away with this error?
I tried putting the -DentityExpansionLimit=2147480000 in the clean-install-run like this:
mvn -DentityExpansionLimit=2147480000 ...
but I still get the same error. Please help me.

@jimkont
Copy link
Member

jimkont commented Oct 20, 2016

I also noticed this problem at some point but thought of it as a dump problem

I see a few references around like
elastic/stream2es#65
Wikidata/Wikidata-Toolkit#243
Wikidata/Wikidata-Toolkit#244

can you test the following?
Edit the run script and add the argument described here and see if it works?
elastic/stream2es#65 (comment)

@EllysahatNMI
Copy link
Author

EllysahatNMI commented Oct 20, 2016

I already tried putting jdk.xml.totalEntitySizeLimit and totalEntitySizeLimit as indicated in elastic/stream2es#65 (comment) but I still get the same error.
It stops when the import is around 600,000+ pages already.

I edited the clean-install-run script like this:

mvn -Djdk.xml.totalEntitySizeLimit=2147480000 $BATCH -f ../pom.xml clean && . ../install-run "$@"

I'm out of ideas. Please help.

@jimkont
Copy link
Member

jimkont commented Oct 20, 2016

can you edit the run script?
clean-install-run calls install-run and that calls run in the end

@EllysahatNMI
Copy link
Author

Hi. Sorry I'm a little confused on where I should put it in the run script.

This is how I did it:
mvn $MAVEN_DEBUG $BATCH scala:run "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS" "-Djdk.xml.totalEntitySizeLimit=2147480000"

Is it correct?

@jimkont
Copy link
Member

jimkont commented Oct 20, 2016 via email

@EllysahatNMI
Copy link
Author

I tried doing as you said but same error appears.

I edited https://github.com/dbpedia/extraction-framework/blob/master/run#L45 like this:
mvn $MAVEN_DEBUG -DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000 $BATCH scala:run "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS"

@jimkont
Copy link
Member

jimkont commented Oct 20, 2016

sorry again, let's make a final check

try putting this here: https://github.com/dbpedia/extraction-framework/blob/master/dump/pom.xml#L64-L71
this is a JVM argument and we should either pass it from the pom.xml or MAVEN_OPTS
not sure if directly from maven would work

@EllysahatNMI
Copy link
Author

Still the same error when I put this:
-DtotalEntitySizeLimit=2147480000
-Djdk.xml.totalEntitySizeLimit=2147480000

@EllysahatNMI
Copy link
Author

EllysahatNMI commented Oct 21, 2016

I put it in this line:
https://github.com/dbpedia/extraction-framework/blob/master/dump/pom.xml#L39-L41

Here's the snippet:

 <jvmArgs>
     <jvmArg>-server</jvmArg>
     <jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
     <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>

And it worked! Thank you so much! :)

@jimkont
Copy link
Member

jimkont commented Oct 21, 2016

Great! can you do a final check to see which of the two arguments is the one needed?
You can either tell us here or make a PR directly

Thanks!

@chile12
Copy link
Contributor

chile12 commented Oct 30, 2016

Thanks guys!
This problem also popped up when extracting abstracts.

@clanstyles
Copy link

I'm trying to use the 'downloa.10000.properties' and then 'extraction.default.properties'

I'm receiving the same error.

Message: JAXP00010004: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING".

I've tried the solutions above, I set extraction-framework/dump/pom.xml's jvmArgs

                                                        <mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
                                                        <jvmArgs>
                                                                <jvmArg>-server</jvmArg>
                                                                <jvmArg>-DtotalEntitySizeLimit=0</jvmArg>
                                                                <jvmArg>-Djdk.xml.totalEntitySizeLimit=0</jvmArg>
                                                        </jvmArgs>

No luck fixing it. I then tried modifying ../run's script

if [[ $SLACK != false && $SlackUrl == https://hooks.slack.com/services* ]] ;
then
  #mvn $MAVEN_DEBUG $BATCH scala:run "-Dlauncher=SlackForwarder"  "-DaddArgs=$SlackUrl|$SlackRegexMap|$LogDir|$1|$SLACK" &
  sleep 5
  PID="$(ps ax | grep java | grep extraction.scripts.SlackForwarder | tail -1 | sed -n -E 's/([0-9]+).*/\1/p' | xargs)"
  echo $PID
  mvn $MAVEN_DEBUG -B scala:run "-Djdk.xml.totalEntitySizeLimit=0" "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS" &> "/proc/$PID/fd/0"
else
  mvn $MAVEN_DEBUG $BATCH scala:run "-Djdk.xml.totalEntitySizeLimit=0" "-Dlauncher=$LAUNCHER" "-DaddArgs=$ADD_ARGS"
fi

You'll notice I added "-Djdk.xml.totalEntitySizeLimit=0". According to the docs, 0 is supposed to set it to unlimited. I did try it with the limits you listed above, that didn't work either.

@chile12
Copy link
Contributor

chile12 commented Nov 2, 2016

Interesting, I just ran an import over 130 languages without a hitch.
Here is the launcher setting I used:

                    <launcher>
                        <id>import</id>
                        <mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
                        <jvmArgs>
                            <jvmArg>-server</jvmArg>
                            <jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
                            <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
                        </jvmArgs>
                        <args>
                            <!-- base folder of downloaded dumps -->
                            <arg>/data/extraction-data/2016-04</arg>
                            <!-- location of SQL file containing MediaWiki table definitions -->
                            <arg>/home/extractor/mediawikiTables.sql</arg>
                            <!-- JDBC URL of MySQL server. Import creates a new database for 
                                each wiki. -->
                            <arg>jdbc:mysql://localhost/?characterEncoding=UTF-8&amp;user=root</arg>
                            <!-- require-download-complete -->
                            <arg>true</arg> 
                            <!-- file name: pages-articles.xml{,.bz2,.gz} -->
                            <arg>pages-articles.xml.bz2</arg>
                            <!-- number of parallel imports; this number depends on the number of processors in use
                                and the type of hard disc (hhd/ssd) and how many parallel file reads it can support -->
                            <arg>16</arg>
                            <!-- languages and article count ranges, comma-separated, e.g. "de,en" 
                                or "@mappings" etc. -->
                            <arg>@downloaded</arg>
                        </args>
                    </launcher>

@clanstyles
Copy link

Was it Java 7 or 8? I was using Java 8 and I was reading that this was added to Java in 8. I'm not testing with Java 7.

@chile12
Copy link
Contributor

chile12 commented Nov 3, 2016

Using Java 8 as well, is this still causing problems with you?

@clanstyles
Copy link

clanstyles commented Nov 4, 2016

@chile12 I'm still processing all of Wiki (2 days later), but it's working no issues. I switched to Java 7.

@EllysahatNMI
Copy link
Author

EllysahatNMI commented Nov 4, 2016

Hi guys. Sorry for the late update.

So I just confirmed that this argument is the correct one:
-Djdk.xml.totalEntitySizeLimit=2147480000

<launcher>
                        <id>import</id>
                        <mainClass>org.dbpedia.extraction.dump.sql.Import</mainClass>
                        <jvmArgs>
                            <jvmArg>-server</jvmArg>
                            <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
                        </jvmArgs>
                        <args>
                            <!-- base folder of downloaded dumps -->
                            <arg>/data/extraction-data/2016-04</arg>
                            <!-- location of SQL file containing MediaWiki table definitions -->
                            <arg>/home/extractor/mediawikiTables.sql</arg>
                            <!-- JDBC URL of MySQL server. Import creates a new database for 
                                each wiki. -->
                            <arg>jdbc:mysql://localhost/?characterEncoding=UTF-8&amp;user=root</arg>
                            <!-- require-download-complete -->
                            <arg>true</arg> 
                            <!-- file name: pages-articles.xml{,.bz2,.gz} -->
                            <arg>pages-articles.xml.bz2</arg>
                            <!-- number of parallel imports; this number depends on the number of processors in use
                                and the type of hard disc (hhd/ssd) and how many parallel file reads it can support -->
                            <arg>16</arg>
                            <!-- languages and article count ranges, comma-separated, e.g. "de,en" 
                                or "@mappings" etc. -->
                            <arg>@downloaded</arg>
                        </args>
                    </launcher>

@roland-c
Copy link

I ran into this problem with the Wikidata extractor, having these settings

<jvmArgs>
     <jvmArg>-server</jvmArg>
     <jvmArg>-DtotalEntitySizeLimit=2147480000</jvmArg>
     <jvmArg>-Djdk.xml.totalEntitySizeLimit=2147480000</jvmArg>
</jvmArgs>

It is possible to remove this limitation by setting the value to 0. The extractor runs fine now.
http://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.70.doc/diag/appendixes/cmdline/Djdkxmltotalentitysizelimit.html

@chile12
Copy link
Contributor

chile12 commented Jan 26, 2017

Thanks for the update Roland, best to integrate this into all the POM files.

james-gould added a commit to james-gould/mediawiki-tools-mwdumper that referenced this issue Mar 24, 2017
ERROR: The accumulated size of entities is "50,000,001" that exceeded the "50,000,000" limit set by "FEATURE_SECURE_PROCESSING"

Fixes several parsing issues raised over the previous 8 years. 

[1] https://www.mediawiki.org/wiki/Manual_talk:MWDumper#Exception_in_thread_.22main.22_java.lang.ArrayIndexOutOfBoundsException:_2048

[2] dbpedia/extraction-framework#487 (comment)
@manzoorali29
Copy link

Can anyone tell me the final solution. I am still stuck in this and tried all the solutions but no luck. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants