Skip to content

Load Data Models

javild edited this page Apr 13, 2016 · 12 revisions

Getting data models

There are two ways of getting the data models that shall populate the CellBase database:

  1. For those users willing to build CellBase knowledgbase from scratch, please follow the tutorial Download Sources
  2. Download data models from http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/.

Load data models

Please, note that before loading the data models into the database the CellBase code must have been previously compiled with maven and injected with database credentials, as explained at the README.md file.

Use the CellBase CLI to load the data models:

cellbase/build/bin$ ./cellbase.sh load
The following options are required: -i, --input -d, --data     --database 

Usage:   cellbase.sh load [options]

Options:
      -C, --config         STRING     CellBase configuration.json file. Have a look at 
                                      cellbase/cellbase-core/src/main/resources/configuration.json for an example 
    * -d, --data           STRING     Data model type to be loaded, i.e. genome, gene, ... 
    *     --database       STRING     Data model type to be loaded, i.e. genome, gene, ... 
          --field          STRING     Use this parameter when an custom update of the database documents is required. Indicate herethe 
                                      full path to the document field that must be updated, e.g. annotation.populationFrequencies. This 
                                      parameter must be used togetherwith a custom file provided at --input and the data to update 
                                      indicated at --data. 
      -h, --help                      Display this help and exit [false]
    * -i, --input          STRING     Input directory with the JSON data models to be loaded. Can also be used to specify acustom json 
                                      file to be loaded (look at the --field parameter). 
      -l, --loader         STRING     Database specific data loader to be used [org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader]
      -L, --log-level      STRING     Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
          --num-threads    INT        Number of threads used for loading data into the database [2]
      -v, --verbose        BOOLEAN    [Deprecated] Set the level of the logging [false]
      -D                              Dynamic parameters go here [{}]

For example, to load all human (GRCh37) data models from the /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/ created in section Build Data Models, into the cellbase_hsapiens_grch37_v4 database and creating the indexes as indicated in the .js scripts within cellbase/cellbase-app/app/mongodb-scripts/, run:

cellbase/build/bin$ ./cellbase.sh load -d variation --database cellbase_hsapiens_grch37_v4 -i /mnt/data/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/ -L debug -Dmongodb-index-folder=/home/cafetero/appl/dev/cellbase/cellbase-app/app/mongodb-scripts/

Please, note that the whole loading and indexing process may need ~24h to complete, depending on the available hardware.

Warning notices

Variant annotation provided by default for the variation dataset, when building CellBase data from scratch, is ENSEMBL variation annotation. CellBase pre-annotated variation collection can only be obtained by the pre-built models provided at http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/. Likewise, population frequencies for 1000 genomes project, UK10K project, GoNL project, ExAC, etc., are not included by default if building the models from scratch. These data are obtained by the CellBase team thanks to additional collaborations and will only be found at the already built variation data models provided at: http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/

After successful load of all data, the corresponding database shall look like:

$ mongo mongodb-dev/cellbase_hsapiens_grch37_v4
MongoDB shell version: 3.0.9
connecting to: mongodb-dev/cellbase_hsapiens_grch37_v4
> show collections;
protein_protein_interaction
clinical
protein
conservation
gene
genome_info
variation_functional_score
genome_sequence
regulatory_region
protein_functional_prediction
variation