Running BIUTEE (Detailed)

Table of Contents Defining Environment Variables Running Scenarios Full Running Option JVM Parameters Configuration File Log File

Defining Environment Variables

Please create an environment variable called DATA and set it to the path of the data directory in the BIUTEE Environment. On UNIX system using bash shell, you could do it using the command below (your-path is the path of the data directory in the BIUTEE Environment):

export DATA="your-path"

From now on, you should see this path when you enter the following command in that terminal:

echo $DATA

You may use these external tutorials for defining environment variables in Windows and in Linux.

Running Scenarios

BIUTEE can be run via two interfaces:

EOP Interface, accessing LAP and EDA. Currently this entire interface is provided via the class '''eu.excitementproject.eop.biutee.rteflow.systems.excitement.BiuteeMain'''.
Stand-alone Interface, accessing proprietary classes for preprocessing, training and testing.

BIUTEE can be run on these kinds of input:

RTE Pairs - used in RTE 1-5 main task. It is formatted as an XML file, consisting of a sequence of text-hypothesis pairs. This is the common kind of input, if you are not sure what to use - use this kind.
RTE Sum - used in RTE 6-7. It is formatted as a folder, with topics, where each topic has documents and hypotheses. To train and test on this input, it must first be indexed (as described later in the steps table).

In order to run BIUTEE, you can choose between two running options:

Quick Running Option - a single script for the entire system's execution. Recommended for first-time users.
Full Running Option - a series of command line executions for running the system in a more fine-grained level.

Full Running Option

The following table describes how to run BIUTEE via command line, in different scenarios. The steps are presented in the order in which they should be run.

Note that you must follow only one specific scenario. For example, if you wish to run via the EOP interface and use RTE Pairs input, follow only the EOP+Pairs rows (and the ALL rows, which apply to all scenarios). According to this, you should be running steps: 1, 3, 4, 5, 10, 11.

For further details regarding running EOP in general via command line, see here.

#	Scenario	Step	Command	Notes
1	ALL	Configure general system parameters	Edit configuration file `biutee.xml`	Parameters like number of threads and knowledge resources. More details here.
2	Stand-alone + Sum	Perform indexing	Refer to http://cs.biu.ac.il/~nlp/downloads/biutee	To perform this step, please download BIUTEE's previous version and follow the required steps on the user guide (Section 1.7 and Section 3).
3	ALL	Run EasyFirst parser server	Windows: `runeasyfirst.bat` Linux: `runeasyfirst.sh`	Must be run on a separate command line window from BIUTEE. The server must be running at least when BIUTEE's LAP/preprocessing is running, but may be left running in other times. The same EasyFirst run may be used for multiple runs of BIUTEE. You may want to shut it down when it is not required, to conserve system resources. Do it by pressing `Ctrl-C`.
4	ALL	Configure training parameters	Edit configuration file	Mostly set the dataset to be the devset. More details here.
5	EOP + Pairs	Preprocess training data + Train	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems. excitement.BiuteeMain''' -Dexec.args="biutee.xml lap_train,train"`	In order to just preprocess, instead of `lap_train,train` provide only `lap_train`. Similarly, in order to just train, provide only `train`. [1] The preprocessing output is a java-serialized file, with a name and path determined by the configuration parameter `rte_pairs_preprocess/ serialization_filename`. [2] The training output is several java-serialized files named `labeled_samplesX.ser` and `serialized_resultsX.ser`, and some XML files named `model_search_X.xml` and `model_predictions_X.xml`.
6	Stand-alone + Pairs	Preprocess training data	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems. rtepairs.RTEPairsPreProcessor''' -Dexec.args="biutee.xml train"`	[1]
7	Stand-alone + Pairs	Train	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems.rtepairs. RTEPairsETETrainer''' -Dexec.args="biutee.xml"`	[2]
8	Stand-alone + Sum	Preprocess training data	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems.rtesum. preprocess.RTESumPreProcessor''' -Dexec.args="biutee.xml"`	[3] The preprocessing output is a java-serialized file, with a name and path determined by the configuration parameter `rte_sum_preprocess/ serialization_filename`.
9	Stand-alone + Sum	Train	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems. rtesum.RTESumETETrainer''' -Dexec.args="biutee.xml"`	[2]
10	ALL	Configure testing parameters	Edit configuration file	Mostly set the dataset to be the testset. More details here.
11	EOP + Pairs	Preprocess testing data + Test	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems. excitement.BiuteeMain''' -Dexec.args="biutee.xml lap_test,test"`	In order to just preprocess, instead of `lap_test,test` provide only `lap_test`. Similarly, in order to just test, provide only `test`. The preprocess output is a series of XML files in the folder `$BIUTEE/workdir/lap_output`. Each XMI is a dump of the UIMA-CAS of one text-hypothesis pair. [4] The test output is written in the log file `logfile.log`.
12	Stand-alone + Pairs	Preprocess testing data	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems. rtepairs.RTEPairsPreProcessor''' -Dexec.args="biutee.xml test"`	[1]
13	Stand-alone + Pairs	Test	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems.rtepairs. RTEPairsETETester''' -Dexec.args="biutee.xml"`	[4]
14	Stand-alone + Sum	Preprocess testing data	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems.rtesum. preprocess.RTESumPreProcessor''' -Dexec.args="biutee.xml"`	[3]
15	Stand-alone + Sum	Test	`mvn -f $EOP/biutee/pom.xml exec:java -Dexec.mainClass= '''eu.excitementproject.eop. biutee.rteflow.systems. rtesum.RteSumETETester''' -Dexec.args="biutee.xml"`	[4]

NOTES:

All commands must be run from $BIUTEE/workdir. This could be achieved using the cd command, like: cd C:\Biutee\workdir.
For the mvn commands to work, you need the Maven executable to be in your system path. If it is not, add it, or provide full path to it in the commands.
In order to run via Eclipse IDE, perform the specified steps by running each class denoted by -Dexec.mainClass=, with program arguments denoted by -Dexec.args= (without enclosing parentheses), and working directory $BIUTEE/workdir.

JVM Parameters

To improve JVM efficiency, it is recommended to run it with these JVM parameters:

-server, for using Java server VM.
-Xmx2g, for allocating 2GB of memory. Other values can be used, according to available memory and the number of threads used. When preprocessing at least 1.5GB must be allocated. When training and testing, at least 4GB must be allocated, and an additional 1GB for each additional thread. For example, when using 3 threads, allocate at least 6GB.
-XX:+UseParallelGC, -XX:+UseParallelOldGC and -XX:ParallelGCThreads=<math>\alpha</math>, for using parallel garbage collection, with a threads. <math>\alpha</math> can be specified as the number of threads determined in the configuration file.

In order to specify JVM parameters, put them as a concatenated value of the environment variable MAVEN_OPTS. More details in one of the notes here.

Configuration File

A key element in the BIUTEE environment is the configuration file, found at $BIUTEE/workdir/biutee.xml.

Most values in the configuration file can stay exactly as provided. We bring here the details of some of the values you may wish (or need) to change.

Section	Property	Value
`rte_pairs_preprocess`	`training_data`	Path to a pairs dataset XML, for training data.
`rte_pairs_preprocess`	`training_data_annotated`	true/false - indicates whether the training dataset is annotated (has gold-standard annotations). Must be true for training.
`rte_pairs_preprocess`	`training_serialization_filename`	Path to a file where preprocessing output will be written to, for training data.
`rte_pairs_preprocess`	`test_data`	Path to a pairs dataset XML, for test data.
`rte_pairs_preprocess`	`test_data_annotated`	true/false - indicates whether the training dataset is annotated (has gold-standard annotations). If the dataset is annotated, the system will output the test accuracy at the end of the test.
`rte_pairs_preprocess`	`test_serialization_filename`	Path to a file where preprocessing output will be written to, for test data.
`rte_sum_preprocess`	`dataset`	Path to a training sum dataset folder. Note that this parameter is used for both training and test.
`rte_sum_preprocess`	`serialization_filename`	Path to the file where preprocessing output (of the training data) will be written to. Note that this parameter is used for both training and test.
`rte_pairs_train_and_test`	`serialized_training_data`	Path to the file where preprocessing output (of the training data) was written to.
`rte_pairs_train_and_test`	`serialized_test_data`	Path to the file where preprocessing output (of the test data) was written to.
`rte_sum_train_and_test`	`training_data`	An indication to the sum training data, as 3 values connected with `#`: Dataset name: `RTE6` or `RTE7` Type: `DEV` or `TEST` Relative path to the dataset folder For example: `RTE6#DEV#RTE6_DEVSET`
`rte_sum_train_and_test`	`serialized_training_data`	Path to the file where preprocessing output (of the training data) was written to.
`rte_sum_train_and_test`	`test_data`	An indication to the sum test data, as 3 values connected with `#`: Dataset name: `RTE6` or `RTE7` Type: `DEV` or `TEST` Relative path to the dataset folder For example: `RTE6#TEST#RTE6_TESTSET`
`rte_sum_train_and_test`	`serialized_test_data`	Path to the file where preprocessing output (of the test data) was written to.
`rte_pairs_train_and_test`, `rte_sum_train_and_test`	`threads`	Number of threads to be used during training and testing. Preprocessing is always single-threaded. The JVM parameter `-Xmx` must be set according to the number of threads to allow a heap that is large enough. If this is not set as required, your system may work very slow, and might crash. Usually, 4GB suffices for a single thread, plus 1GB for any additional thread.
`rte_pairs_train_and_test`, `rte_sum_train_and_test`	`gap_hybrid_mode`	true or false, indicating whether hybrid mode is active or not (default = false). Hybrid mode is a mode in which on the fly transformations are not performed. Instead, the system uses only reliable transformations, but not always reaches a complete proof. The gap between the partial proof result and the hypothesis is also counted in the confidence calculation. See also the next parameter: collapse-mode
`rte_pairs_train_and_test`, `rte_sum_train_and_test`	`collapse-mode`	true or false, indicating whether all the text is treated as one parse-tree, which includes all its sentences as subtrees (default = false). The common practice is to set the same value to gap_hybrid_mode and collapse-mode (i.e., when gap_hybrid_mode is true, then collapse-mode should be true as well. When gap_hybrid_mode is false, then collapse-mode should be false as well).
`rte_pairs_train_and_test`, `rte_sum_train_and_test`	`classifier-optimization`	An optional parameter. The value can be one of the following strings: "ignore_dataset_and_optimize_accuracy" or "ignore_dataset_and_optimize_f1". This parameter controls whether the learning algorithm optimizes for accuracy or for F1. If the parameter is not set then the optimization follows the nature of the dataset. For RTE-pairs datasets the default is to optimize for accuracy, while for RTE-sum the default is to optimize for F1.
`transformations`	`knowledge_resources`	A comma-separated list of knowledge resources, out of these values: WORDNET, WIKIPEDIA, GEO, CATVAR, BAP, LIN_DEPENDENCY_ORIGINAL, LIN_PROXIMITY_ORIGINAL, LIN_DEPENDENCY_REUTERS, VERB_OCEAN, ORIG_DIRT, REVERB, BINARY_LIN, FRAMENET, SYNTACTIC, REDIS_LIN_PROXIMITY, REDIS_LIN_DEPENDENCY, REDIS_BAP, REDIS_DIRT, REDIS_REVERB These are all values from the enum: `'''eu.excitementproject.eop. transformations. builtin_knowledge. KnowledgeResource'''`
`transformations`	`multiword_resources`	A comma-separated list of lexical knowledge resources, out of these values: WORDNET, WIKIPEDIA, CATVAR, BAP, LIN_DEPENDENCY_ORIGINAL, LIN_PROXIMITY_ORIGINAL, LIN_DEPENDENCY_REUTERS, VERB_OCEAN, REDIS_LIN_PROXIMITY, REDIS_LIN_DEPENDENCY, REDIS_BAP. Values are from to the same enum with `true` in their last parameter, except for GEO (which must not be used here). For these resources, the system shall handle multi-word expressions.

Log File

The system uses the log4j platform for logging. A log4j properties file is automatically created under $BIUTEE/workdir/log4j.properties with recommended values. If a file under that name already exists, the system uses it instead of creating a new one. There is no need to change any of the definitions in the file, but you may do so if you wish to change logging behavior. You may be assisted by the log4j Manual.

Under the recommended values, a new log file is created for every run of the system in $BIUTEE/workdir/logfile.log. If this file already exists from a previous run, it is renamed to logfile.log_''date''_''time''.log.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running BIUTEE (Detailed)

Table of Contents

Defining Environment Variables

Running Scenarios

Full Running Option

JVM Parameters

Configuration File

Log File

Documentation

Get Involved

Clone this wiki locally