-
Notifications
You must be signed in to change notification settings - Fork 39
Using Reach from a FAT JAR
Reach can be compiled, along with all its required dependencies, into a single JVM JAR file, known as a 'FAT' JAR. To produce a default FAT JAR for Reach:
sbt assembly
By default, the Reach FAT JAR runs in 'batch' mode: processing papers from an input directory and producing output into an output directory. Both directories, and many other configuration options, may be specified in the application.conf
configuration file. For more information on 'batch' mode see the
The FAT JAR may also be configured to run a different main program. For instance, to produce a FAT JAR which runs the Reach Shell, compile it like this:
sbt -DmainClass=org.clulab.reach.ReachShell assembly
To produce a FAT JAR which runs a small web service to process files, compile the FAT JAR like this:
sbt -DmainClass=org.clulab.reach.export.server.FileProcessorWebUI.scala assembly
Two output types are supported:
- "arizona"
- a column-based format (tab-delimited
.tsv
file), where each row represents a distinct reaction observed in text - see the description
- "json"
- document annotations and mentions serialized to
json
- See an example
Create a directory for the documents to be read by Reach:
mkdir -p path/to/my/input/directory
Move the papers you wish Reach to process to this directory, but please ensure that they are in one of our supported formats. Details details on these formats, including instructions on how to retrieve papers formatted as .nxml
from open access, can be found here.
NOTE: This section assumes you've already cloned the REACH
repository locally (git clone https://github.com/clulab/reach.git
).
Before running things, there are a few properties that may need to be updated in the project's config file. You can find the application.conf
file at reach/main/src/main/resources/application.conf
-
papersDir
- set this property to whatever you're using for
path/to/my/input/directory
- set this property to whatever you're using for
-
outDir
- set this property to whatever you're using for
path/to/my/output/directory
- if this directory doesn't already exist, it will be created at runtime
- set this property to whatever you're using for
-
outputTypes
- the output formats to target for export of results. We recommend "fries" or "arizona"
- "fries" will produce a series of
.json
files for each paper - "arizona" will produced a column-based output file for each paper in the format described in this document
-
threadLimit
- Use this to specify the number of papers to attempt to process in parallel
- Note that as you increase parallelization, you will also need to allocate more memory (RAM) in the project's
.sbtopts
file
-
withAssembly
- setting this to true will signal inclusion of causal precedence (i.e., reaction A causally precedes reaction B) information in the output
-
logging.logfile
- specify the path where the log file should be written
-
logging.loglevel
- specify the level for logging. Default: INFO level.
-
ignoreSections
- a list of paper sections that should be ignored when processing the input papers. In order to be ignored, these strings must match the relevant fields in the
nxml
ortsv
input files exactly
- a list of paper sections that should be ignored when processing the input papers. In order to be ignored, these strings must match the relevant fields in the
-
restart.useRestart
- specify whether to log successfully processed input papers and/or whether to skip logged papers on subsequent processing runs. See restart section below for more information.
-
restart.logfile
- specify the path where the restart log file should be written. By default, the restart log file is written to the output directory; that is the directory specified by the
outDir
configuration varible (above).
- specify the path where the restart log file should be written. By default, the restart log file is written to the output directory; that is the directory specified by the
sbt "run-main org.clulab.reach.RunReachCLI"
If the restart capability is enabled by the restart.useRestart
flag (true
by default), ReachCLI will append the name of each successfully processed input file (one per line) to a log file (by default restart.log
). The restart log file is located, by default, in the OUTPUT directory, as Reach might not have write permission on the input directory. When ReachCLI starts up it looks for and reads the restart log file to find which input files it can SKIP. The restart log file can be empty or missing, in which case ReachCLI will process (or reprocess if restarted) all input files. Input files which fail to process are not written to the restart log file. You can manually edit this text file to control which files are skipped during the run.