feat: adding parser for uniprot_variants evidence #214

DSuveges · 2024-09-11T22:06:20Z

Evidence generation process:

Disease/target/variant evidence is fetched from the Uniprot SPARQL API.
As variants are given in rs identifiers, they have to be mapped to variant id (chr_pos_alt_ref), but this is not an easy process because alleles need to be matched with their predicted amino acid change on the given uniprot entry. This is a painfully slow process (~7.5 hours for all 24k unique rsids, on a single thread), so an incremental process was implemented, where mappings are cached and only new rsids are requested, then the new mappings are added to the cache.
disease labels/omim identifiers are mapped to EFO via Ontoma.

What is missing

Based on targetToDiseaseAnnotation classify confidence
Based on variantToTargetAnnotation classify direction of effect/target modulation

Usage:

Calling the parser:

python modules/UniprotVariants.py \
    --rsid_cache /Users/dsuveges/repositories/evidence_datasource_parsers/rsid_cache \
    --ontoma_cache_dir ontoma_cache \
    --output_file uniprot_evidence_test.json.gz

Logs running the parser on a small example dataset:

2024-09-11 22:32:33 INFO UniprotVariants - main: Starting Uniprot evidence parser.
2024-09-11 22:32:33 INFO UniprotVariants - main: Output file: uniprot_evidence_test.json.gz
2024-09-11 22:32:37 INFO UniprotVariants - extract_variants_data: Extracting Uniprot variants data from https://sparql.uniprot.org/sparql API endpoint.
2024-09-11 22:33:03 INFO UniprotVariants - extract_variants_data: Data extraction completed.
2024-09-11 22:33:03 INFO UniprotVariants - extract_variants_data: Number of disease/target/variant evidence: 41475.
2024-09-11 22:33:14 INFO UniprotVariants - main: Number of unique rsids: 13.
2024-09-11 22:33:14 INFO UniprotVariants - __init__: Reading cache from: /Users/dsuveges/repositories/evidence_datasource_parsers/rsid_cache.
2024-09-11 22:33:16 INFO UniprotVariants - __init__: Number of cached rsids: 72.
2024-09-11 22:33:16 INFO UniprotVariants - map_rsids: Number of missing rsids that needs to be mapped: 13.
2024-09-11 22:33:16 INFO UniprotVariants - map_rsids: Missing rsids are broken down into 1 chunks.
2024-09-11 22:33:16 INFO UniprotVariants - map_rsids: Mapping chunk 0...
2024-09-11 22:33:29 INFO UniprotVariants - map_rsids: Mapping rsids completed.
2024-09-11 22:33:30 INFO UniprotVariants - main: Resolving variant ids.
2024-09-11 22:33:31 INFO UniprotVariants - main: Adding EFO mappings.
2024-09-11 22:33:40 INFO UniprotVariants - main: Writing data.

DSuveges added 2 commits September 11, 2024 18:20

feat: adding uniprot variant parser

2124416

fix: fixing cache management

2c98bbf

DSuveges linked an issue Sep 12, 2024 that may be closed by this pull request

Improve disease/target evidence ingestion from uniprot opentargets/issues#3459

Closed

8 tasks

DSuveges added 3 commits September 13, 2024 09:35

fix: changing join from inner to left

c0fe4ef

merge master

acb7fb2

feat: adding confidence to uniprot

e71e251

DSuveges marked this pull request as draft January 7, 2025 14:32

DSuveges added 4 commits January 7, 2025 16:36

feat: abstracting uniprot parsers and rsid mapper

db93767

feat: unifying uniprot literature parser

9c9dbfb

refactor: cleaning up the uniprot parsers

9956a20

fix: making sure the mapped list of rsids is unique

4aa1743

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adding parser for uniprot_variants evidence #214

feat: adding parser for uniprot_variants evidence #214

DSuveges commented Sep 11, 2024

feat: adding parser for uniprot_variants evidence #214

Are you sure you want to change the base?

feat: adding parser for uniprot_variants evidence #214

Conversation

DSuveges commented Sep 11, 2024

Evidence generation process:

What is missing

Usage: