Removed druggability-only genes from the gene_metadata pre-processing #159

jaclynbeck-sage · 2024-11-21T03:13:38Z

This PR addresses AG-1579 and also does a good amount of code cleanup/refactoring to reduce code duplication and increase maintainability. The goal of the cleanup is to start moving most of the heavy lifting out of the notebook and into a script with easier-to-maintain code, since this notebook needs to be run periodically. Down the road I want everything to be a script that can be auto-kicked-off, instead of a notebook.

AG-1579:

Updated the code that finds all Ensembl IDs in all ADT-related files so that it excludes the druggability file
Uploaded the new gene_metadata file to Synapse

Cleanup:

Removed outputs from the notebook so future commits are cleaner
Moved several major chunks of code from the notebook into data_analysis/agora/notebooks/preprocessing/preprocessing_utils.py
Refactored those functions to be cleaner, have more documentation/comments, and be properly formatted/linted
Also cleaned up some of the existing code in preprocessing_utils.py
Changed the UniProt mapping notebook to use one of the new preprocesing_utils functions, which it had previously been duplicating in the notebook.
Updated .gitignore with a couple of folders/files that get created when running the notebooks

For ease of comparison (the diffs of the notebook are hard to read clearly), here is a human-readable version of:

Code moved:

Cells 3 and 4 of old notebook -> preprocessing_utils.get_all_adt_ensembl_ids (refactored quite a bit)
Cell 10 of old notebook -> preprocessing_utils.standardize_list_item (list-of-lists code also removed because mygene no longer returns those)
Cells 11 and 12 of old notebook -> preprocessing_utils.merge_duplicate_ensembl_ids (refactored quite a bit)
- Code that puts non LOC### genes first was removed: with the removal of druggability genes this is no longer relevant
Cell 14 of old notebook -> preprocessing_utils.query_ensembl_version_api
Refactor of cell 25 of old notebook to be cleaner

… step, and bumped the version of the file in the config to match the new file on Synapse

…gability removed from gene_info

…few more local files to ignore

…ng_utils

…et all ADT ids

…t's actually a list of strings

sonarqubecloud · 2024-11-22T22:56:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

JessterB

lgtm!

jaclynbeck-sage added 6 commits November 20, 2024 19:12

Removed druggability-only genes from the gene_metadata pre-processing…

72e2e1c

… step, and bumped the version of the file in the config to match the new file on Synapse

Undid bump in gene_metadata version, it can't be increased until drug…

5e2e3ea

…gability removed from gene_info

Addressed SonarCloud issue with exceptions, updated gitignore with a …

bba8fc1

…few more local files to ignore

More code cleanup, moved duplicate ensembl ID handling to preprocessi…

e5f7597

…ng_utils

Updated uniprot mapping script to use new preprocessing function to g…

503d301

…et all ADT ids

Fixed standardize_list_item to work for possible_replacement

ab1bb82

jaclynbeck-sage marked this pull request as ready for review November 22, 2024 22:18

jaclynbeck-sage requested a review from a team as a code owner November 22, 2024 22:18

jaclynbeck-sage added 2 commits November 22, 2024 14:50

Fix to possible_replacement so the list field is standardized after i…

3a60655

…t's actually a list of strings

Updated comment in the standardize list function

ce9dc5e

BWMac approved these changes Nov 25, 2024

View reviewed changes

JessterB approved these changes Nov 25, 2024

View reviewed changes

jaclynbeck-sage merged commit 6f41530 into dev Nov 25, 2024
9 checks passed

jaclynbeck-sage deleted the jbeck/AG-1579/gene_metadata_remove_druggability branch November 25, 2024 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed druggability-only genes from the gene_metadata pre-processing #159

Removed druggability-only genes from the gene_metadata pre-processing #159

jaclynbeck-sage commented Nov 21, 2024 •

edited

Loading

sonarqubecloud bot commented Nov 22, 2024

JessterB left a comment

Removed druggability-only genes from the gene_metadata pre-processing #159

Removed druggability-only genes from the gene_metadata pre-processing #159

Conversation

jaclynbeck-sage commented Nov 21, 2024 • edited Loading

sonarqubecloud bot commented Nov 22, 2024

Quality Gate passed

JessterB left a comment

Choose a reason for hiding this comment

jaclynbeck-sage commented Nov 21, 2024 •

edited

Loading