Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed druggability-only genes from the gene_metadata pre-processing #159

Merged
merged 8 commits into from
Nov 25, 2024

Conversation

jaclynbeck-sage
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage commented Nov 21, 2024

This PR addresses AG-1579 and also does a good amount of code cleanup/refactoring to reduce code duplication and increase maintainability. The goal of the cleanup is to start moving most of the heavy lifting out of the notebook and into a script with easier-to-maintain code, since this notebook needs to be run periodically. Down the road I want everything to be a script that can be auto-kicked-off, instead of a notebook.

AG-1579:

  1. Updated the code that finds all Ensembl IDs in all ADT-related files so that it excludes the druggability file
  2. Uploaded the new gene_metadata file to Synapse

Cleanup:

  1. Removed outputs from the notebook so future commits are cleaner
  2. Moved several major chunks of code from the notebook into data_analysis/agora/notebooks/preprocessing/preprocessing_utils.py
  3. Refactored those functions to be cleaner, have more documentation/comments, and be properly formatted/linted
  4. Also cleaned up some of the existing code in preprocessing_utils.py
  5. Changed the UniProt mapping notebook to use one of the new preprocesing_utils functions, which it had previously been duplicating in the notebook.
  6. Updated .gitignore with a couple of folders/files that get created when running the notebooks

For ease of comparison (the diffs of the notebook are hard to read clearly), here is a human-readable version of:

Code moved:

  • Cells 3 and 4 of old notebook -> preprocessing_utils.get_all_adt_ensembl_ids (refactored quite a bit)
  • Cell 10 of old notebook -> preprocessing_utils.standardize_list_item (list-of-lists code also removed because mygene no longer returns those)
  • Cells 11 and 12 of old notebook -> preprocessing_utils.merge_duplicate_ensembl_ids (refactored quite a bit)
    • Code that puts non LOC### genes first was removed: with the removal of druggability genes this is no longer relevant
  • Cell 14 of old notebook -> preprocessing_utils.query_ensembl_version_api
  • Refactor of cell 25 of old notebook to be cleaner

@jaclynbeck-sage jaclynbeck-sage marked this pull request as ready for review November 22, 2024 22:18
@jaclynbeck-sage jaclynbeck-sage requested a review from a team as a code owner November 22, 2024 22:18
Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@jaclynbeck-sage jaclynbeck-sage merged commit 6f41530 into dev Nov 25, 2024
9 checks passed
@jaclynbeck-sage jaclynbeck-sage deleted the jbeck/AG-1579/gene_metadata_remove_druggability branch November 25, 2024 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants