Skip to content

Build Overview

Kayla-Morrell edited this page Nov 8, 2019 · 21 revisions

NOTE: Documentation has moved to the README file for this repo

Code in this package builds the db0, OrgDb, PFAM, GO, KEGG, ChipDb, probe, cdf and TxDb packages. As of Bioconductor 3.5 we no longer build KEGG, ChipDb, probe or cdf packages. BSGenome, SNPlocs and XtraSNPlocs packages are built by Herve.

Script overview:

The build system is based on a set of bash and R scripts, in various different directories that are called by the top-level master.sh script which calls sub-scripts.
High-level steps:

  • clear up disk space
    • Downloading the resources and generating the annotations will take up over 100GB of disk space. Be sure to have this space otherwise the process will fail. It may be useful to decide a minimum amount of required space for the scripts to run in the future.
  • data download, parse, build
    • Run the download scripts.
    • Remove old dbs from the db/ directory; save only metadata.sqlite (also under version control in case you delete it). Do not remove any files from db/ once you start parsing. Products sent there are either needed in a subsequent step or may be the final product.
    • Run the parse scripts:
      src_parse.sh calls data-specific getsrc.sh which calls srcdb.sql. This step generally parses the downloaded data and in some cases creates dbs to be used in the build step, e.g., ensembl.sqlite and in other cases produces the final db product, e.g., PFAM.sqlite. AFAICT these files are created after parse and should not be removed:
      • go -> gosrcsrc.sqlite
      • gene -> genesrc.sqlite
      • ucsc -> gpsrc.sqlite
      • plasmodb -> plasmoDBsrc.sqlite
      • ensembl -> ensembl.sqlite
      • pfam -> PFAM.sqlite
    • Run the build scripts:
      src_build.sh calls data-specif getdb.sh and temp_metadata.sql. Outputs from the build step are the chipsrc* and chipmapsrc* sqlite dbs.
    • Run copyLatest.sh to insert bd schema version in GO, PFAM, KEGG and YEAST dbs.
    • Run map_counts/scripts/getdb.sh to check the quality of the intermediate sqlite dbs. This script counts tables in a subset of the chipsrc databases and numbers are recorded in the existing map_counts.sqlite. These data are compared to numbers from the last release and a warning is issued for discrepancies >10%. Remember map_counts.sqlite is under version control so we have a record of data loss / gain over the releases.
    • Commit all code changes to git; do not add data files.
  • db0 packages:
    • makeDbZeros.R creates db0 packages by calling out to AnnotationForge::sqlForge_wrapBaseDBPkgs.R.
    • build and check db0s
    • if everything is okay, install the db0 packages.
    • build, check, and install AnnotationForge against new db0s
    • IPI are added to chipsrc*.sqlite pfam and prosite tables in uniprot/processDataForBuild.R
    • spot check:
      chipmapsrc_*.sqlite for mouse should have 8 tables:
> dbListTables(con)
[1] "EGList"             "accession"          "accession_unigene" 
[4] "image_acc_from_uni" "metadata"           "refseq"            
[7] "sqlite_stat1"       "unigene"     

chipsrc_*.sqlite for mouse should have 32 tables:

> dbListTables(con2)
 [1] "accessions"            "chrlengths"            "chromosome_locations" 
 [4] "chromosomes"           "cytogenetic_locations" "ec"                   
 [7] "ensembl"               "ensembl2ncbi"          "ensembl_prot"         
[10] "ensembl_trans"         "gene_info"             "gene_synonyms"        
[13] "genes"                 "go_bp"                 "go_bp_all"            
[16] "go_cc"                 "go_cc_all"             "go_mf"                
[19] "go_mf_all"             "kegg"                  "map_counts"           
[22] "map_metadata"          "metadata"              "mgi"                  
[25] "ncbi2ensembl"          "pfam"                  "prosite"              
[28] "pubmed"                "refseq"                "sqlite_stat1"         
[31] "unigene"               "uniprot"   
  • OrgDb, PFAM.db, GO.db packages:
    • Modify AnnotationForge/inst/extdata/Gentlemanlab/ANNDBPKG-INDEX.TXT for the version and potentially the path to the sqlite file.
    • Install the modified AnnotationForge. This must be done before building the OrgDb, PFAM and GO db packages because the new versions are in the template in AnnotationForge that was just modified.
    • Run the portion of makeTerminalDBPkgs.R that generates the OrgDbs, PFAM.db and GO.db
    • Install the most current GO.db before building and checking OrgDbs
    • Open an R session and load a newly created OrgDb object to ensure that all resources are up-to-date. Specifically, check the GO and ENSEMBL download dates in the metadata of the object. These should be more recent than the last release.
  • reactome.db
    • This package is contributed by Willem Ligtenberg. We download files from www.reactome.org and add a few tables to Willem's db. Full details in the reactome/README. NOTE: As of Bioconductor 3.5, Willem has been building the full package. He was sent Marc's scripts and he now runs these in addtion to his own. There is no longer any need for us to do anything to the reactome.db package.
  • testing
    • Build and check AnnotationDbi and AnnotationForge against the new OrgDdbs, PFAM.db, GO.db and reactome.db
  • TxDb packages:
    • Modify GenomicFeatures/inst/script/makeTxDb.R as appropriate
    • Run the portion of makeTerminalDBPkgs.R that generates the TxDbs
  • Spot Check:
    • Load a few of the packages in an R session and check the dates to be sure that the appropriately dated packages are being used.

Version bumps:

At this time, the project does not have universal guidelines for the versioning of annotation packages, however, we want to develop a systematic approach to version bumps for the packages we generate. Starting with BioC 3.6, we will bump the 'y' portion of the 'x.y.z' version for the annotations packages generated from these scripts, e.g., db0, OrgDb, TxDb, GO.db, PFAM.db, etc.

Propagated but not rebuilt:

  • all ChipDb, probe and cdf packages
  • KEGG.db
  • inparanoid dbs (8 of them here)

FIXMEs:

  • uniprot, unigene, and inparanoid are static and AFAIK there are no replacements
  • tair: current parse script is broken so we are still using data from April 2015; can we find these data from another source?
  • kegg is static: investigate using KEGGREST instead
  • Marc's notes on db0s:
    The .db0 packages need to be phased out in favor of using existing org packages (many of which can now be retrieved from the AnnotationHub) and the new makeChipPackage() function.
  • Should IPI identifiers be removed from the pfam and prosite tables of chipsrc sqlite dbs for human, mouse, rat, zebrafish, bovine, and chicken? IPI was a database of protein IDs maintained by Embl-Ebi that was closed in 2011; replacements are RefSeq and EnsemblProt IDs.
  • Marc's notes on tRNAs:
    Fom makeTerminalDBPkgs: "... this last script is out of repair (but we need a new solution for tRNAs anyways) source(system.file("script","maketRNAFDb.R", package="GenomicFeatures")) TODO: I need to edit the following script so that it makes it in the TxDbOutDir... I think this has been superceded by something better (AnnotationHub) source(system.file("script","maketRNAFDb.R", package="GenomicFeatures")) ... "
  • old scripts The update_individual.sh script appears to be outdated, and Jim MacDonald didn't run that one; it was intended to build all the individual affy ChipDb packages. Jim outlined how he built the ChipDb packages on the [SOP page] (https://hedgehog.fhcrc.org/bioconductor/trunk/bioC/Docs/StandardOperatingProcedure/annotation-build-overview.Rmd).

References

Documentation from Marc and Jim:
https://hedgehog.fhcrc.org/bioconductor/trunk/bioC/Docs/StandardOperatingProcedure/annotation-release.Rmd https://hedgehog.fhcrc.org/bioconductor/trunk/bioC/Docs/StandardOperatingProcedure/annotation-build-overview.Rmd