Skip to content

Latest commit

 

History

History
150 lines (110 loc) · 12.5 KB

README.md

File metadata and controls

150 lines (110 loc) · 12.5 KB

Inference

This repository contains all Inference engine code of the RD-Switchboard, that is excutable on EC2 machines. The repository contains separate java applications as well as libraries, used by these applications.

Requirements

Program requires Java 1.7 and Apache Maven 3.0.5. It will also require Neo4j 2.3.1.

Program has been tested on Ubuntu Linux 14.04 and should work on any other linux as well

Project structure

Inference enginge consists from a selection of different java modules united into one global Maven project, located in the main Repository folder. Part of these modules are libraries used by other modules and remaining modules are separate applications, designed to perform different taks. All applications will require one or two Neo4j instances to be avaliable and will use them in the exclusive mode, so no two applications can work at the same time. The Inefence Engine suppose to run them one by one in some batch process and such design allows to exclude some tasks from the batch to save time and resources.

Project modules and data are separated into different folders to make navigation between them easy.

Build and manage

To build the whole project simply run mvn package from the repository folder. The Maven will download all required dependancied and will build all existing modules. It will also generate distribution in the Build/distribution/target/inference-${project.version} folder and will produce gz and bz2 archives with this distribution. If archives or assemble folder are not required, it can be disabled in the assemble configuration located at Build/distribution/src/assembly/bin.xml

To install porject into your local maven repository, execute mvn install from the repository folder. After that you will be able to build any module separately by executint mvn package in the module folder, but, if one of depending modules has been changed, new installation of this module will be required.

You also can build single module without installing it, by executing Maven command mvn install -pl :${module.name} -am from the repository folder. For example, to build only ANDS import module, you can execute:

mvn install -pl import_ands -am

To change project version, execute mvn versions:set -DgenerateBackupPoms=false from the main repository folder and enter new version.

Distribute

If Distribute module has been compiled, the Maven will create global distribution with all modules and all dependacies located in Build/distribution/target/inference-${project.version}. It will also create bz2 and gz archives of this distribution. We recommend to upload whole archive on the server and unpack it there. The bz2 version is usally a bit smaller but will require more time to unpack. You can use either of them or create your own archive by zipping the distribution folder.

Installation

You will need to install at least two neo4j databases - aggregator and nexus. You can download neo4j from the official Neo4j Web site

Copy archive to the server and unpack it:

tar -xzvf neo4j-community-2.3.1-unix.tar.gz
cp neo4j-community-2.3.1-unix neo4j-aggregator
cp neo4j-community-2.3.1-unix neo4j-nexus

Next, unpack the Inference archive, by executing tar -xzvf inference-${project.version}.tar.bz2 for bz2 or tar -xjvf inference-${project.version}.tar.gz for gzip. Replace ${project.version} with actual project version:

tar -xzvf inference-1.3.0.tar.gz

Configuration

The distribution will have properties folder where all properties files will be located. Each Module should have at lease one configuration file. Please refer to each module documentation to learn about possible configuration options and how you can modify them. You can have more that one properties file for each module with different configuration. To execute them, you can add path to a configuration file as parameter to the jar file.

Execution

All executable modules can be executed by calling java: java -jar ${module.name}-${module.version} [${optinal.path.to.properties.file}]. The output of the program can be directed to the log file and program itself could be run as a daemon, allows you to monitor the process without interfere to the program work. We recommend to add nohup keywoard before calling the Java, that will ensure, that program will finish its work even if your connection with server will be terminated.

For example, to execute ANDS import with custom configuration file and without interruptions, you can use this command

nohup java -jar import_ands-1.3.0.jar properties/import_ands/properties > logs/import_ands.txt 2>&1 &

We also recommend to combine all programs into some batch process so they would be called one by one. A shell script will be most siutable for that. A sample of such shell script is provided in the Scripts folder

Execution Order

Some tasks must be executed in a specyfed order. We recommend to run all import applicatios first, then run search application, linking applications and export application.

Suggested run order will be:

  • import_institutions : To import predefined institutions nodes
  • import_patterns: To import predefined search patterns
  • import_arc: To import ARC grants
  • import_nhmrc: To import NHMRC grants
  • import_ands: To import ANDS records
  • import_dryad: To import Dryad records
  • import_cern: To import CERN records
  • import_orcid: To import ORCID records
  • import_dara: To import DaRa records
  • import_openaire: To import DLI records
  • import_crossref: To search and import crossref records
  • google_search: To search ANDS grant titles in Google
  • link_nodes: To link all existing nodes by DOI, ORCID ID, etc
  • link_web_researchers: To link nodes with Web:Researcher nodes
  • copy_harmonyzed: To copy harmonyzed data into Nexus Neo4j
  • delete_nodes: To delete orpant nodes from Nexus Neo4j
  • test_connections : To test Nexus Neo4j existing connections numbers
  • export_graph_json: To export final graphs
  • export_keys: To export existing keys

Please be aware, what all this programs will require an explict access to the Neo4j database, therefore no two programs can run at the same time on the same database. The google_search software will only be able to process 10000 requests per day due to Google CSE limitations. The google_search is the only program who will access Neo4j in a read-only mode and will not make any changed in the Neo4j database, storing all found information into a Cache instead. Therefore google_search can use a copy of the Neo4j database and run as soon as you have imported data sources, selected for a Google Search. That will allow you to build the rest of the database while Google Search will be executed and save a dicent ammount of time.