-
Notifications
You must be signed in to change notification settings - Fork 53
Load Data Models
There are two ways of getting the data models that shall populate the CellBase database:
- For those users willing to build CellBase knowledgbase from scratch, please follow the tutorial Download Sources
- Download data models from http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/.
Please, note that before loading the data models into the database the CellBase code must have been previously compiled with maven and injected with database credentials, as explained at the README.md file.
Use the CellBase CLI to load the data models:
cellbase/build/bin$ ./cellbase.sh load
The following options are required: -i, --input -d, --data --database
Usage: cellbase.sh load [options]
Options:
-C, --config STRING CellBase configuration.json file. Have a look at
cellbase/cellbase-core/src/main/resources/configuration.json for an example
* -d, --data STRING Data model type to be loaded, i.e. genome, gene, ...
* --database STRING Data model type to be loaded, i.e. genome, gene, ...
--field STRING Use this parameter when an custom update of the database documents is required. Indicate herethe
full path to the document field that must be updated, e.g. annotation.populationFrequencies. This
parameter must be used togetherwith a custom file provided at --input and the data to update
indicated at --data.
-h, --help Display this help and exit [false]
* -i, --input STRING Input directory with the JSON data models to be loaded. Can also be used to specify acustom json
file to be loaded (look at the --field parameter).
-l, --loader STRING Database specific data loader to be used [org.opencb.cellbase.mongodb.loader.MongoDBCellBaseLoader]
-L, --log-level STRING Set the logging level, accepted values are: debug, info, warn, error and fatal [info]
--num-threads INT Number of threads used for loading data into the database [2]
-v, --verbose BOOLEAN [Deprecated] Set the level of the logging [false]
-D Dynamic parameters go here [{}]
For example, to load all human (GRCh37) data models from the /tmp/data/cellbase/v4/homo_sapiens_grch37/mongodb/
created in section Build Data Models, into the cellbase_hsapiens_grch37_v4
database and creating the indexes as indicated in the .js
scripts within cellbase/cellbase-app/app/mongodb-scripts/
, run:
cellbase/build/bin$ ./cellbase.sh load -d variation --database cellbase_hsapiens_grch37_v4 -i /mnt/data/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/ -L debug -Dmongodb-index-folder=/home/cafetero/appl/dev/cellbase/cellbase-app/app/mongodb-scripts/
Please, note that the whole loading and indexing process may need ~24h to complete, depending on the available hardware.
Warning notices
Variant annotation provided by default for the variation
dataset, when building CellBase data from scratch, is ENSEMBL variation annotation. CellBase pre-annotated variation
collection can only be obtained by the pre-built models provided at http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/. Likewise, population frequencies for 1000 genomes project, UK10K project, GoNL project, ExAC, etc., are not included by default if building the models from scratch. These data are obtained by the CellBase team thanks to additional collaborations and will only be found at the already built variation
data models provided at: http://bioinfo.hpc.cam.ac.uk/downloads/cellbase/v4/homo_sapiens_grch37/mongodb/
After successful load of all data, the corresponding database shall look like:
$ mongo mongodb-dev/cellbase_hsapiens_grch37_v4
MongoDB shell version: 3.0.9
connecting to: mongodb-dev/cellbase_hsapiens_grch37_v4
> show collections;
protein_protein_interaction
clinical
protein
conservation
gene
genome_info
variation_functional_score
genome_sequence
regulatory_region
protein_functional_prediction
variation