This repository contains all Inference engine code of the RD-Switchboard, that is excutable on EC2 machines. The repository contains separate java applications as well as libraries, used by these applications.
Program requires Java 1.7 and Apache Maven 3.0.5. It will also require Neo4j 2.3.1.
Program has been tested on Ubuntu Linux 14.04 and should work on any other linux as well
Inference enginge consists from a selection of different java modules united into one global Maven project, located in the main Repository folder. Part of these modules are libraries used by other modules and remaining modules are separate applications, designed to perform different taks. All applications will require one or two Neo4j instances to be avaliable and will use them in the exclusive mode, so no two applications can work at the same time. The Inefence Engine suppose to run them one by one in some batch process and such design allows to exclude some tasks from the batch to save time and resources.
Project modules and data are separated into different folders to make navigation between them easy.
- Build: Project Distribution module
- Data: Static data, incliding list of institutions, patterns, sevices, arc and nhmrc grants and test nodes
- Importers: All modules, designed to import data into Neo4j from different Data Sources
- Importers/ANDS: ANDS Metadata Import Module
- Importers/ARC: ARC Grants Import Module
- Importers/Cern: CERN Metadata Import Module
- Importers/CrossRef: CrossRef Metadata Import Module
- Importers/DaRa: Da-Ra Metadata Import Module
- Importers/Dryad: Dryad Metadata Import Module
- Importers/NHMRC: NHMRC Metadata Import Module
- Importers/ORCID: ORCID Metadata Import Module
- Importers/OpenAIRE: OpenAIRE Metadata Import Module
- Importers/Web: Static Metadata Import Module (Institutions, Patterns and --Services--)
- Libraries: All Libraries, used by other Modules
- Libraries/CrossRef: CrossRef API Libarary
- Libraries/DDI: DDI Metadata Crosswalk (not used at this moment because of insufficient metadata)
- Libraries/DLI: DLI Metadata Crosswalk (used by OpenAIRE)
- Libraries/DaRa: DaRa Metadata Crosswalk (used by DaRa)
- Libraries/Fuzzy: Fuzzy Searh Library
- Libraries/Google: Google CSE and Google Cache Libraries
- Libraries/Marc21: Marc21 Metadata Crosswalk (used by CERN)
- Libraries/Mets: Mets Metadata Crosswalk (used by Dryad)
- Libraries/ORCID: ORCID API Library
- Libraries/RIF_CS: RIF:CS Metadata Crosswalk (used by ANDS)
- Libraries/Scopus: Scopus API Library
- Libraries/graph_utils: Graph Representation Library used in RD-Switchboard project
- Libraries/neo4j_utils: Neo4j Database Helper Library used in RD-Switchboard project
- Linkers: Modules used to link nodes from different data sources
- Linkers/Google: Web Researcher Linker Will link different data sources with Web:Researcher nodes, found by Google CSE
- Linkers/Static: Node Linker Will link different data sources by existsing metadata (ODCID ID, Scopus ID, DOI etc)
- Obsolete: Code no longer used by Engine, but witch migth be used in the future
- Search: Search Engines Modules
- Search/Google: Google Search Module to search texts in Google CSE and generate Google cache
- Scripts: Folder will contain Inference scripts samples.
- Utils: Different Helper Applications
- Utils/Neo4j: Neo4j Applications
- Utils/Neo4j/copy_harmonized: Copy Harmonized Modle will generate Harmonized Neo4j Instance (Nexus) from Aggrigation Neo4j Instance
- Utils/Neo4j/delete_nodes: Delete Orphant Nodes Module will delete orphant nodes (the nodes which do not have any connections to the other nodes
- Utils/Neo4j/export_graph_json: Export Graph JSON Module will export finished graphs into S3
- Utils/Neo4j/export_keys: Export Keys Module will export all existing keys, used to synchronize RDS records with graphs
- Utils/Neo4j/replace_source: Replace Source Module will replace source name with another
- Utils/Neo4j/test_connections: Test connections Module will test existing connections and generate report
To build the whole project simply run mvn package
from the repository folder. The Maven will download all required dependancied and will build all existing modules. It will also generate distribution in the Build/distribution/target/inference-${project.version} folder and will produce gz and bz2 archives with this distribution. If archives or assemble folder are not required, it can be disabled in the assemble configuration located at Build/distribution/src/assembly/bin.xml
To install porject into your local maven repository, execute mvn install
from the repository folder. After that you will be able to build any module separately by executint mvn package in the module folder, but, if one of depending modules has been changed, new installation of this module will be required.
You also can build single module without installing it, by executing Maven command mvn install -pl :${module.name} -am
from the repository folder. For example, to build only ANDS import module, you can execute:
mvn install -pl import_ands -am
To change project version, execute mvn versions:set -DgenerateBackupPoms=false
from the main repository folder and enter new version.
If Distribute module has been compiled, the Maven will create global distribution with all modules and all dependacies located in Build/distribution/target/inference-${project.version}. It will also create bz2 and gz archives of this distribution. We recommend to upload whole archive on the server and unpack it there. The bz2 version is usally a bit smaller but will require more time to unpack. You can use either of them or create your own archive by zipping the distribution folder.
You will need to install at least two neo4j databases - aggregator and nexus. You can download neo4j from the official Neo4j Web site
Copy archive to the server and unpack it:
tar -xzvf neo4j-community-2.3.1-unix.tar.gz
cp neo4j-community-2.3.1-unix neo4j-aggregator
cp neo4j-community-2.3.1-unix neo4j-nexus
Next, unpack the Inference archive, by executing tar -xzvf inference-${project.version}.tar.bz2
for bz2 or tar -xjvf inference-${project.version}.tar.gz
for gzip. Replace ${project.version} with actual project version:
tar -xzvf inference-1.3.0.tar.gz
The distribution will have properties
folder where all properties files will be located. Each Module should have at lease one configuration file. Please refer to each module documentation to learn about possible configuration options and how you can modify them. You can have more that one properties file for each module with different configuration. To execute them, you can add path to a configuration file as parameter to the jar file.
All executable modules can be executed by calling java: java -jar ${module.name}-${module.version} [${optinal.path.to.properties.file}]
. The output of the program can be directed to the log file and program itself could be run as a daemon, allows you to monitor the process without interfere to the program work. We recommend to add nohup
keywoard before calling the Java, that will ensure, that program will finish its work even if your connection with server will be terminated.
For example, to execute ANDS import with custom configuration file and without interruptions, you can use this command
nohup java -jar import_ands-1.3.0.jar properties/import_ands/properties > logs/import_ands.txt 2>&1 &
We also recommend to combine all programs into some batch process so they would be called one by one. A shell script will be most siutable for that. A sample of such shell script is provided in the Scripts folder
Some tasks must be executed in a specyfed order. We recommend to run all import applicatios first, then run search application, linking applications and export application.
Suggested run order will be:
- import_institutions : To import predefined institutions nodes
- import_patterns: To import predefined search patterns
- import_arc: To import ARC grants
- import_nhmrc: To import NHMRC grants
- import_ands: To import ANDS records
- import_dryad: To import Dryad records
- import_cern: To import CERN records
- import_orcid: To import ORCID records
- import_dara: To import DaRa records
- import_openaire: To import DLI records
- import_crossref: To search and import crossref records
- google_search: To search ANDS grant titles in Google
- link_nodes: To link all existing nodes by DOI, ORCID ID, etc
- link_web_researchers: To link nodes with Web:Researcher nodes
- copy_harmonyzed: To copy harmonyzed data into Nexus Neo4j
- delete_nodes: To delete orpant nodes from Nexus Neo4j
- test_connections : To test Nexus Neo4j existing connections numbers
- export_graph_json: To export final graphs
- export_keys: To export existing keys
Please be aware, what all this programs will require an explict access to the Neo4j database, therefore no two programs can run at the same time on the same database. The google_search software will only be able to process 10000 requests per day due to Google CSE limitations. The google_search is the only program who will access Neo4j in a read-only mode and will not make any changed in the Neo4j database, storing all found information into a Cache instead. Therefore google_search can use a copy of the Neo4j database and run as soon as you have imported data sources, selected for a Google Search. That will allow you to build the rest of the database while Google Search will be executed and save a dicent ammount of time.