Skip to content

Using Reach from a FAT JAR

Tom Hicks edited this page Aug 16, 2017 · 26 revisions

Creating a FAT JAR

Reach can be compiled, along with all its required dependencies, into a single JVM JAR file, known as a 'FAT' JAR. To produce a default FAT JAR for Reach:

sbt assembly

By default, the Reach FAT JAR runs in 'batch' mode: processing papers from an input directory and producing output into an output directory. Both directories, and many other configuration options, may be specified in the application.conf configuration file. For more information on 'batch' mode see the

The FAT JAR may also be configured to run a different main program. For instance, to produce a FAT JAR which runs the Reach Shell, compile it like this:

sbt -DmainClass=org.clulab.reach.ReachShell assembly

To produce a FAT JAR which runs a small web service to process files, compile the FAT JAR like this:

sbt -DmainClass=org.clulab.reach.export.server.FileProcessorWebUI.scala assembly

Two output types are supported:

  1. "arizona"
  • a column-based format (tab-delimited .tsv file), where each row represents a distinct reaction observed in text
  • see the description
  1. "json"

Running multiple papers

Step 1: Prepare your input and output directories

Create a directory for the documents to be read by Reach:

mkdir -p path/to/my/input/directory

Move the papers you wish Reach to process to this directory, but please ensure that they are in one of our supported formats. Details details on these formats, including instructions on how to retrieve papers formatted as .nxml from open access, can be found here.

Step 2: Configure application.conf

NOTE: This section assumes you've already cloned the REACH repository locally (git clone https://github.com/clulab/reach.git).

Before running things, there are a few properties that may need to be updated in the project's config file. You can find the application.conf file at reach/main/src/main/resources/application.conf

  • papersDir

    • set this property to whatever you're using for path/to/my/input/directory
  • outDir

    • set this property to whatever you're using for path/to/my/output/directory
    • if this directory doesn't already exist, it will be created at runtime
  • outputTypes

    • the output formats to target for export of results. We recommend "fries" or "arizona"
    • "fries" will produce a series of .json files for each paper
    • "arizona" will produced a column-based output file for each paper in the format described in this document
  • threadLimit

    • Use this to specify the number of papers to attempt to process in parallel
    • Note that as you increase parallelization, you will also need to allocate more memory (RAM) in the project's .sbtopts file

Additional properties

  • withAssembly

    • setting this to true will signal inclusion of causal precedence (i.e., reaction A causally precedes reaction B) information in the output
  • logging.logfile

    • specify the path where the log file should be written
  • logging.loglevel

    • specify the level for logging. Default: INFO level.
  • ignoreSections

    • a list of paper sections that should be ignored when processing the input papers. In order to be ignored, these strings must match the relevant fields in the nxml or tsv input files exactly
  • restart.useRestart

    • specify whether to log successfully processed input papers and/or whether to skip logged papers on subsequent processing runs. See restart section below for more information.
  • restart.logfile

    • specify the path where the restart log file should be written. By default, the restart log file is written to the output directory; that is the directory specified by the outDir configuration varible (above).

Step 3: Run ReachCLI

sbt "run-main org.clulab.reach.RunReachCLI"

Restart Capability

If the restart capability is enabled by the restart.useRestart flag (true by default), ReachCLI will append the name of each successfully processed input file (one per line) to a log file (by default restart.log). The restart log file is located, by default, in the OUTPUT directory, as Reach might not have write permission on the input directory. When ReachCLI starts up it looks for and reads the restart log file to find which input files it can SKIP. The restart log file can be empty or missing, in which case ReachCLI will process (or reprocess if restarted) all input files. Input files which fail to process are not written to the restart log file. You can manually edit this text file to control which files are skipped during the run.