Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update gene info processing for druggability revamp #163

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ on:
jobs:
# test job includes unit tests and coverage
pre-commit:
runs-on: ubuntu-latest
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 } # deep clone for setuptools-scm
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ These expectations are defined in the `/great_expectations/gx/plugins/expectatio

#### Nested Columns

If the transform includes nested columns (example: `druggability` column in `gene_info` tranform), please follow these four steps:
If the transform includes nested columns (example: `ensembl_info` column in `gene_info` tranform), please follow these four steps:
1. In the config file, add the nested column name to the `gx_nested_columns` flag for the specific transform. This will convert the column values to a JSON parsable string.
```
gx_nested_columns:
Expand Down
12 changes: 6 additions & 6 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ datasets:
- gene_info:
files:
- name: gene_metadata
id: syn25953363.13
id: syn25953363.14
format: feather
- name: igap
id: syn12514826.5
Expand All @@ -162,16 +162,16 @@ datasets:
- name: median_expression
id: syn27211878.2
format: csv
- name: druggability
id: syn13363443.11
format: csv
- <<: *genes_biodomains_files
- name: tep_adi_info
id: syn51942280.3
format: csv
- name: ensg_to_uniprot_mapping
id: syn54113663.3
format: tsv
- name: pharos_classes
id: syn64123611.1
format: csv
final_format: json
custom_transformations:
adjusted_p_value_threshold: 0.05
Expand All @@ -192,7 +192,7 @@ datasets:
uniprotkb_accession: uniprotkb_accessions
resource_identifier: ensembl_gene_id
provenance:
- syn25953363.13
- syn25953363.14
- syn12514826.5
- syn12514912.3
- *agora_proteomics_provenance
Expand All @@ -201,10 +201,10 @@ datasets:
- *rna_diff_expr_data_provenance
- syn12540368.51
- syn27211878.2
- syn13363443.11
- *genes_biodomains_provenance
- syn51942280.3
- syn54113663.3
- syn64123611.1
agora_rename:
symbol: hgnc_symbol
destination: *dest
Expand Down
4 changes: 2 additions & 2 deletions gx_suite_definitions/gene_info.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@
"# biodomains\n",
"validator.expect_column_values_to_be_of_type(\"biodomains\", \"list\")\n",
"validator.expect_column_values_to_have_list_members_of_type(column=\"biodomains\", member_type=\"str\", mostly=0.95)\n",
"validator.expect_column_values_to_have_list_members(column=\"biodomains\", list_members={\n",
"validator.expect_column_values_to_have_list_members(column=\"biodomains\", list_members=sorted([\n",
" 'Apoptosis',\n",
" 'Vasculature',\n",
" 'Lipid Metabolism',\n",
Expand All @@ -292,7 +292,7 @@
" 'RNA Spliceosome',\n",
" 'Tau Homeostasis',\n",
" 'Myelination'\n",
" }\n",
" ])\n",
")"
]
},
Expand Down
22 changes: 7 additions & 15 deletions src/agoradatatools/etl/transform/gene_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def transform_gene_info(
proteomics_srm = transform.transform_proteomics(df=datasets["proteomics_srm"])
target_list = datasets["target_list"]
median_expression = datasets["median_expression"]
druggability = datasets["druggability"]
pharos_classes = datasets["pharos_classes"]
biodomains = datasets["genes_biodomains"]
tep_info = datasets["tep_adi_info"]
uniprot = datasets["ensg_to_uniprot_mapping"]
Expand Down Expand Up @@ -49,19 +49,6 @@ def transform_gene_info(
.reset_index()
)

# these are the interesting columns of the druggability dataset
useful_columns = [
"ensembl_gene_id",
"sm_druggability_bucket",
"safety_bucket",
"abability_bucket",
"pharos_class",
"classification",
"safety_bucket_definition",
"abability_bucket_definition",
]
druggability = druggability[useful_columns]

target_list = nest_fields(
df=target_list,
grouping="ensembl_gene_id",
Expand All @@ -77,10 +64,15 @@ def transform_gene_info(
)

druggability = nest_fields(
df=druggability,
df=(
BWMac marked this conversation as resolved.
Show resolved Hide resolved
pharos_classes.groupby("ensembl_gene_id")["pharos_class"]
.apply(list)
.reset_index()
),
grouping="ensembl_gene_id",
new_column="druggability",
drop_columns=["ensembl_gene_id"],
nested_field_is_list=False,
)

biodomains = biodomains.dropna(subset=["biodomain", "ensembl_gene_id"])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -431,46 +431,36 @@
"json_schema": {
"$id": "http://example.com/example.json",
"$schema": "https://json-schema.org/draft/2019-09/schema",
"default": [],
"items": {
"default": {},
"properties": {
"abability_bucket": {
"type": "number"
},
"abability_bucket_definition": {
"maxLength": 1000,
"minLength": 44,
"type": "string"
},
"classification": {
"maxLength": 1000,
"minLength": 22,
"type": "string"
},
"pharos_class": {
"type": [
"string",
"null"
]
},
"safety_bucket": {
"type": "number"
},
"safety_bucket_definition": {
"maxLength": 1000,
"minLength": 50,
"default": null,
"examples": [
beatrizsaldana marked this conversation as resolved.
Show resolved Hide resolved
{
"pharos_class": [
"Tchem"
]
}
],
"properties": {
"pharos_class": {
"default": [],
"items": {
"default": "",
"enum": [
"Tdark",
"Tchem",
"Tbio",
"Tclin",
null
],
"title": "Pharos object",
"type": "string"
},
"sm_druggability_bucket": {
"type": "number"
}
},
"type": "object"
"title": "The pharos_class Schema",
"type": "array"
}
},
"title": "Druggability Schema",
"title": "Root Schema",
"type": [
"array",
"object",
"null"
]
}
Expand Down Expand Up @@ -516,25 +506,25 @@
"kwargs": {
"column": "biodomains",
"list_members": [
"Myelination",
"Vasculature",
"Synapse",
"Immune Response",
"DNA Repair",
"APP Metabolism",
"Apoptosis",
"Autophagy",
"Endolysosome",
"Proteostasis",
"Mitochondrial Metabolism",
"Cell Cycle",
"DNA Repair",
"Endolysosome",
"Epigenetic",
"Immune Response",
"Lipid Metabolism",
"Metal Binding and Homeostasis",
"Mitochondrial Metabolism",
"Myelination",
"Oxidative Stress",
"Proteostasis",
"RNA Spliceosome",
"Structural Stabilization",
"Synapse",
"Tau Homeostasis",
"Apoptosis",
"Oxidative Stress",
"APP Metabolism",
"Structural Stabilization"
"Vasculature"
]
},
"meta": {}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,40 +1,31 @@
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"$id": "http://example.com/example.json",
"type": ["array", "null"],
"default": [],
"title": "Druggability Schema",
"items": {
"type": "object",
"default": {},
"properties": {
"sm_druggability_bucket": {
"type": "number"
},
"safety_bucket": {
"type": "number"
},
"abability_bucket": {
"type": "number"
},
"pharos_class": {
"type": ["string", "null"]
},
"classification": {
"type": ["object", "null"],
"default": null,
"title": "Root Schema",
"properties": {
"pharos_class": {
beatrizsaldana marked this conversation as resolved.
Show resolved Hide resolved
"type": "array",
"default": [],
"title": "The pharos_class Schema",
"items": {
"type": "string",
"minLength": 22,
"maxLength": 1000
},
"safety_bucket_definition": {
"type": "string",
"minLength": 50,
"maxLength": 1000
},
"abability_bucket_definition": {
"type": "string",
"minLength": 44,
"maxLength": 1000
"default": "",
"title": "Pharos object",
"enum": [
"Tdark",
"Tchem",
"Tbio",
"Tclin",
null
]
}
}
}
},
"examples": [{
"pharos_class": [
"Tchem"
]
}]
}
12 changes: 6 additions & 6 deletions test_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ datasets:
- gene_info:
files:
- name: gene_metadata
id: syn25953363.13
id: syn25953363.14
format: feather
- name: igap
id: syn12514826.5
Expand All @@ -162,16 +162,16 @@ datasets:
- name: median_expression
id: syn27211878.2
format: csv
- name: druggability
id: syn13363443.11
format: csv
- <<: *genes_biodomains_files
- name: tep_adi_info
id: syn51942280.3
format: csv
- name: ensg_to_uniprot_mapping
id: syn54113663.3
format: tsv
- name: pharos_classes
id: syn64123611.1
format: csv
final_format: json
custom_transformations:
adjusted_p_value_threshold: 0.05
Expand All @@ -192,7 +192,7 @@ datasets:
uniprotkb_accession: uniprotkb_accessions
resource_identifier: ensembl_gene_id
provenance:
- syn25953363.13
- syn25953363.14
- syn12514826.5
- syn12514912.3
- *agora_proteomics_provenance
Expand All @@ -201,10 +201,10 @@ datasets:
- *rna_diff_expr_data_provenance
- syn12540368.51
- syn27211878.2
- syn13363443.11
- *genes_biodomains_provenance
- syn51942280.3
- syn54113663.3
- syn64123611.1
agora_rename:
symbol: hgnc_symbol
destination: *dest
Expand Down
20 changes: 20 additions & 0 deletions tests/test_assets/gene_info/input/pharos_classes_good_input.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
ensembl_gene_id,uniprot_id,hgnc_symbol,pharos_class
ENSG00000000005,Q9H2S6,TNMD,Tbio
jaclynbeck-sage marked this conversation as resolved.
Show resolved Hide resolved
ENSG00000000419,O60762,DPM1,Tbio
ENSG00000000457,Q8IZE3,SCYL3,Tbio
ENSG00000000460,Q9NSG2,C1orf112,Tbio
ENSG00000000938,P09769,FGR,Tchem
ENSG00000000971,P08603,CFH,Tbio
ENSG00000001036,Q9BTY2,FUCA2,Tchem
ENSG00000001084,P48506,GCLC,Tchem
ENSG00000001167,P23511,NFYA,Tbio
ENSG00000001460,Q5TH74,STPG1,Tbio
ENSG00000001461,Q6P499,NIPAL3,Tdark
ENSG00000001497,Q9Y4W2,LAS1L,Tbio
ENSG00000001561,Q9Y6X5,ENPP4,Tbio
ENSG00000001617,Q13275,SEMA3F,Tbio
ENSG00000001626,P13569,CFTR,Tclin
ENSG00000001629,Q9P2G1,ANKIB1,Tdark
ENSG00000001630,Q16850,CYP51A1,Tchem
ENSG00000001631,O00522,KRIT1,Tbio
ENSG00000001631,O00522,KRIT1,Tchem
Loading
Loading