-
Notifications
You must be signed in to change notification settings - Fork 1
genes_json_file
Dave Lawrence edited this page Nov 21, 2022
·
3 revisions
PyReference now uses cdot data files for loading gene/transcript information. cdot provides transcripts for HGVS resolution, and needs to work with all historical and latest versions of GTF/GFF3s from RefSeq and Ensembl. Making a JSON format that can work with both projects reduces effort going forward.
cdot hosts pre-built JSON data.
Below are the latest files for Refseq/Ensembl. Note: GRCh37 is not updated frequently so can be quite old.
- RefSeq GRCh37 - annotation v105 (2022-03-07) 16Mb
- RefSeq GRCh38 - annotation v110 (2022-04-12) 28Mb
- Ensembl 37 only annotation v87 (2017-03-20) - 27Mb
- Ensembl 38 only annotation v108 (2022-10-04) - 34Mb
See cdot wiki
git clone https://github.com/SACGF/cdot
export CDOT_DIR=$(pwd)/cdot/generate_transcript_data
# This generates a gene info JSON file (only need 1 for all generated gene JSON files)
export [email protected] # Make sure to change this
${CDOT_DIR}/gene_info.sh
CDOT_VERSION=$(${CDOT_DIR}/cdot_json.py --version)
GENE_INFO_JSON=gene-info-${CDOT_VERSION}.json.gz
# Example for refseq GRCh38
FILENAME=GCF_000001405.40_GRCh38.p14_genomic.gff.gz
URL=https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/110/GCF_000001405.40_GRCh38.p14/${FILENAME}
wget ${URL}
cdot/generate_transcript_data/cdot_json.py gff3_to_json --url=${URL} --genome-build=GRCh38 --gene-info-json ${GENE_INFO_JSON} --output ${FILENAME}.json.gz ${FILENAME}