Skip to content

Latest commit

 

History

History
221 lines (162 loc) · 19.2 KB

README.md

File metadata and controls

221 lines (162 loc) · 19.2 KB

Pipeline 2 Documentation

VFB Pipeline 2 comprises five servers/services and six data pipelines:

  • Pipeline 2 servers:
    • VFB knowledge base (vfb-kb)
    • VFB triple store (vfb-triplestore)
    • SOLr + preconfigured VFB SOLr core (vfb-solr)
    • owlery (vfb-owlery)
    • VFB Neo4J production instance (vfb-prod)
  • Pipeline 2 data pipelines:
    • Transform KB1 to KB2 (vfb-kb2kb) [to be obsoleted]
    • Validate KB (vfb-validate)
    • Data collection (vfb-collect-data)
    • Triple store ingestion (vfb-updatetriplestore)
    • Data transformation and dumps for production instances (vfb-dumps)
    • VFB production instance ingestion (vfb-update-prod)

Server and data pipelines are combined into 6 general sub-pipelines which are configured as Jenkins jobs (currently located here). This documentation describes all 6 sub-pipelines in detail, including which role the individual servers and data pipelines play. All high-level documentation including images can be found on the vfb-pipeline-config repo. Note: There was once a pipeline server named vfb-integration-api which has since been discarded in favour of vfb-dumps.

Pipeline Overview

Sub-pipeline: Deploy KB (pip_vfb-kb)

  • Summary: This pipeline loads the current KB from backup, applies a series of transformation steps and validates the resulting version of the KB for VFB Schema compliance. The finalised KB is backed up, and spun up, again from backup to clear caches. Components:
    • vfb-kb (deployment of the VFB knowledge base)
    • vfb-kb2kb (provisional data pipeline managing the migration from KB1 to KB2)
    • vfb-validate (validation pipeline to check if KB2 is in the correct basic shape for neo4j2owl)
  • Jenkins job
  • Dependents: pip-triplestore

Service: vfb-kb

Detailed notes on vfb-kb

  • There is nothing specifically important about vfb-kb, other than that it comes in two flavours. The pipeline spins of a KB instance which gets a tiny bit of pre-processing in vfb-collectdata (basically setting the labels correctly). This instance is spun up from backup, only for the pipeline, run, and thrown away after the pipeline is finished. It is important to note that this system means that vfb-kb pipeline edition is not necessarily the exact same as vfb-kb curation edition - vfb-kb pipeline edition corresponds to vfb-kb curation edition at the time of the last backup. So, if you want them to correspond exactly, you need to make sure the backup step is run right before vfb-kb is spun up.
  • Currently the Dockerfile with the neo4j2owl plugin (and APOC!) is on a branch! So be careful when you merge this in!

Data pipeline: vfb-kb2kb [provisional]

Detailed notes on vfb-kb2kb

  • Currently, in order to perform the KB2KB migration, a table is required to decide what type an entity is. This is currently on a branch in the pipeline repo, which needs to be taken into account when the pipeline is merged all in. This can probably be merged in, but sould be done with the usual care (pull request, review that nothing important has changed accidentally - this branch has been created years ago).
  • The script that performs the change is on an unmerged branch in the VirtualFlyBrain/VFB_neo4j repo.
  • The script should be obsoleted once the migration to KB2 is completed

Data pipeline: vfb-validate

Detailed notes on vfb-validate

  • The actual validation process is implemented as a python script. It is not complete in terms of validation, but it is doing a few things, like checking that every node has at least one owl base type, and iri and so on. Each test is a single function in the script, so it should be fairly easy to read them over.
  • The tests are implemented as a report that is printed as part of the Jenkins job - currently the pipeline does not break if there is a validation error!

Sub-pipeline: Deploy triplestore (pip_vfb-triplestore)

  • Summary: This pipeline deploys an empty triplestore, collects all VFB relevant data (including KB and ontologies), and pre-processes and loads the collected data into the triplestore. Components:
    • vfb-triplestore (deploying triplestore)
    • vfb-collect-data (data collection and preprocessing pipeline for all VFB data)
    • vfb-update-triplestore (loading collected data into the triplestore)
  • Jenkins job
  • Depends on: pip-kb
  • Dependents: pip-dumps

Service: vfb-triplestore

  • Image: yyz1989/rdf4j:latest (dockerhub)
  • Git: We do not maintain this, see ticket
  • Summary: The triplestore is currently an unspectacular default implementation of rdf4j-server. We make use of a simple in-memory store that is configured here. The container is maintained elsewhere (see docker-hub pages of image for details).

Detailed notes on vfb-triplestore

  • Triplestore access:
  • We should probably migrate away from this particular image of rdf4j towards our own VFB one, because there is a danger that this container gets removed/updated causing problems for us (though not likely);

Data pipeline: vfb-collect-data

  • Image: virtualflybrain/vfb-pipeline-collectdata:latest (dockerhub)
  • Git: https://github.com/VirtualFlyBrain/vfb-pipeline-collectdata
  • Dockerfile
  • Summary: This container encapsulates a process that downloads a number of source ontologies, obtains the OWL version of the VFB KB, and applies a number of ROBOT-based pre-processing steps, in particular: extracting modules/slices of external ontologies, running consistency checks and serialising as ttl for quicker ingest into triplestore. It also contains the data embargo pipeline and has some provisions for shacl validation.

neo4j2owl:exportOWL()

  • Exporting the KB into OWL2 is managed through a custom procedure (exportOWL()) implemented in the neo4j2owl plugin.
  • The plugin is documented in detail in the repos readme

Detailed notes on vfb-collect-data

  • The process is encoded here. It performs the following steps:
    1. Exporting KB to OWL using the above neo4j2owl:exportOWL() procedure.
    2. Removing embargoed data. The technique applied here is based on using ROBOT query and encoding the embargo logic as SPARQL queries (combined with ROBOT remove).
    3. Downloading external ontologies.
    4. Ontologies in vfb_fullontologies.txt are imported in their entirety.
    5. Ontologies in vfb_slices.txt are sliced. The slice corresponds to a BOTTOM module that has the combined signature of all ontologies in the fullontologies section with the signature of the KB.
      • Note: there is an annoying hack in there that should be fixed, simply by removing the if/else in this code block. First though, we need to understand why this process is so slow (ROBOT memory?).
    6. All ontologies are converted to turtle.
    7. The KB is checked using a SHACL validation engine.
    8. All ontologies ready to be imported into the triplestore are gzipped.

Data pipeline: vfb-update-triplestore

  • Image: virtualflybrain/vfb-pipeline-updatetriplestore:latest (dockerhub)
  • Dockerfile
  • Git: https://github.com/VirtualFlyBrain/vfb-pipeline-updatetriplestore
  • Summary: This container encapsulates a process that (1) sets up the triplestores vfb database and (2) loads all of the ttl files generated by vfb-collect-data into the vfb-triplestore. The image contains the configuration details of triplestore, like choice of triplestore engine.

Detailed notes on vfb-update-triplestore:

  • The process really does nothing more other than loading the ontologies and data collected in the previous step into the triple store.

Sub-pipeline: Data transformation and dumps for production instances (pip_vfb-pipeline-dumps)

  • Summary: This pipeline transforms the knowledge graph in the triplestore into various custom dumps used by downstream services such as the VFB Neo4J production instance, owlery and solr.
  • Jenkins pipeline
  • Depends on: pip-triplestore
  • Dependents: pip-owlery, pip-prod

Data pipeline: vfb-dumps

Detailed notes on vfb-dumps

  • The process performs the following steps (all encoded in the Makefile):
    1. Build dump for vfb-owlery (all logical axioms in triplestore)
    2. Build dump for vfb-prod (VFB production instance)
    3. Build dump for vfb-solr (special json file, created using python)
  • There is a new section in the config file called filters which should be pretty self explanatory. The main thing to know is that the ['iri_prefix'] filter actually checks whether the listed string is contained somewhere in the IRI - so in our case, VFBc_ would have worked as well. The ['neo4j_node_label'] simply filters out every entity that also has a particular node label associated with it.
  • The vfb-solr pipeline is a bit more involved and also relies on the general pipeline config file.
  • There is a new section in the dumps.Makefile (around line 66) that allows adding arbitrary sparql construct queries to the produced dumps. This can be useful, for example, to materialise ad hoc neo labels. Add a new dump:
    1. pick name, add to the correct DUMPS variable (DUMPS_SOLR, DUMPS_PDB, DUMPS_OWLERY)
    2. create new sparql query in sparql/, naming it 'construct_name.sparql', e.g. sparql/construct_image_names.sparql Note that non-sparql goals, like 'inferred_annotation', need to be added separately.

Sub-pipeline: Deploy Owlery (pip_vfb-owlery, Service)

  • Summary: This pipeline deploys the Owlery webservice which is used by VFB to answer ontology queries (no special config).
  • Depends on: vfb-dumps
  • Dependents: None (gepetto)

Service: vfb-owlery

Sub-pipeline: VFB prod (pip_vfb-prod)

  • Summary: This pipeline deploys the production instance of the VFB neo4j database and loads all the relevant data.
  • Depends on: pip-integratio
  • Jenkins pipeline
  • Dependents: None (gepetto)

Service: vfb-prod

Data pipeline: vfb-update-prod

  • Image: virtualflybrain/vfb-pipeline-update-prod:latest (dockerhub)
  • Git: https://github.com/VirtualFlyBrain/vfb-pipeline-update-prod
  • Dockerfile
  • Summary: The update-prod container currently takes an ontology (from the integration layer) and loads it into the the Neo4J instance (vfb-prod) using the neo2owl plugin. Process"
    1. Loading the ontology using the neo4j2owl:owl2Import() procedure
    2. Setting a number of indices (see detailed notes below).

Detailed notes about vfb-update-prod

  • You can set additional Pipeline post-processing steps like indices by editing this file. Note that this file can be used to set arbitrary post-processing cypher queries, not just indices (contrary to the file name). Essentially, all list cypher queries are executed in order right after PDB import is completed.
  • The possible configuration settings for the neo4j2owl:owl2Import() procedure are described here. The configuration is stored here.

Sub-pipeline VFB SOLr (pip_vfb-solr, Service)

Deployment during development phase:

  1. The pipeline is currently deployed as a series of connected Jenkis jobs.
  2. Every sub-pipeline has a Jenkins job that can be restarted manually. Every sub-pipeline will trigger all of its dependents. So if the pip_vfb-dumps pipeline is started, it will automatically trigger the pip_vfb-prod and pip_vfb-owlery pipelines to redeploy as well.
  3. The whole pipeline can be restarted by simply triggering the pip_vfb_kb pipeline to be re-run. This will trigger all downstream sub-pipelines.
  4. The whole pipeline is re-run every night at 4am.