Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding parser for uniprot_variants evidence #214

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

DSuveges
Copy link
Contributor

Evidence generation process:

  1. Disease/target/variant evidence is fetched from the Uniprot SPARQL API.
  2. As variants are given in rs identifiers, they have to be mapped to variant id (chr_pos_alt_ref), but this is not an easy process because alleles need to be matched with their predicted amino acid change on the given uniprot entry. This is a painfully slow process (~7.5 hours for all 24k unique rsids, on a single thread), so an incremental process was implemented, where mappings are cached and only new rsids are requested, then the new mappings are added to the cache.
  3. disease labels/omim identifiers are mapped to EFO via Ontoma.

What is missing

  • Based on targetToDiseaseAnnotation classify confidence
  • Based on variantToTargetAnnotation classify direction of effect/target modulation

Usage:

Calling the parser:

python modules/UniprotVariants.py \
    --rsid_cache /Users/dsuveges/repositories/evidence_datasource_parsers/rsid_cache \
    --ontoma_cache_dir ontoma_cache \
    --output_file uniprot_evidence_test.json.gz

Logs running the parser on a small example dataset:

2024-09-11 22:32:33 INFO UniprotVariants - main: Starting Uniprot evidence parser.
2024-09-11 22:32:33 INFO UniprotVariants - main: Output file: uniprot_evidence_test.json.gz
2024-09-11 22:32:37 INFO UniprotVariants - extract_variants_data: Extracting Uniprot variants data from https://sparql.uniprot.org/sparql API endpoint.
2024-09-11 22:33:03 INFO UniprotVariants - extract_variants_data: Data extraction completed.
2024-09-11 22:33:03 INFO UniprotVariants - extract_variants_data: Number of disease/target/variant evidence: 41475.
2024-09-11 22:33:14 INFO UniprotVariants - main: Number of unique rsids: 13.
2024-09-11 22:33:14 INFO UniprotVariants - __init__: Reading cache from: /Users/dsuveges/repositories/evidence_datasource_parsers/rsid_cache.
2024-09-11 22:33:16 INFO UniprotVariants - __init__: Number of cached rsids: 72.
2024-09-11 22:33:16 INFO UniprotVariants - map_rsids: Number of missing rsids that needs to be mapped: 13.
2024-09-11 22:33:16 INFO UniprotVariants - map_rsids: Missing rsids are broken down into 1 chunks.
2024-09-11 22:33:16 INFO UniprotVariants - map_rsids: Mapping chunk 0...
2024-09-11 22:33:29 INFO UniprotVariants - map_rsids: Mapping rsids completed.
2024-09-11 22:33:30 INFO UniprotVariants - main: Resolving variant ids.
2024-09-11 22:33:31 INFO UniprotVariants - main: Adding EFO mappings.
2024-09-11 22:33:40 INFO UniprotVariants - main: Writing data.

@DSuveges DSuveges linked an issue Sep 12, 2024 that may be closed by this pull request
8 tasks
@DSuveges DSuveges marked this pull request as draft January 7, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve disease/target evidence ingestion from uniprot
1 participant