Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example GA4GH tables from dbGaP to demonstrate how a user would resolve differences in values/coding #4

Open
ianfore opened this issue Nov 8, 2020 · 3 comments

Comments

@ianfore
Copy link
Collaborator

ianfore commented Nov 8, 2020

Added one table so far with unique codings for sex and race.
Column (variable) names also unique
The table so far search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru

Aiming for three or four such tables from dbGaP. The codings and column names will vary.

The question is: how the machine readable information (schema) provided about each table can help make it easier for a data scientist? We assume they are using tools such as python or R and can transform the data in those tools quite easily as long as they have the information to do so. /table/tablename/info provides that information.

Note that in dbGaP the data used in the table above is controlled access. The dataset available through the GA4GH Search API uses values from the dataset but each record (row) is a simulated example - not a real record.

@ianfore
Copy link
Collaborator Author

ianfore commented Nov 8, 2020

See [this diagram](


) for a flow of how the data dictionaries get from dbGaP to GA4GH Search. In the case of the tables I've created this weekend we haven't yet run the data dictionaries through the DNAStack step. So for the moment the schemas we seen in GA4GH Search are autogenerated from the BigQuery table definitions. That doesn't include the enumerated listings of codes. However we can for the moment get the definition from the dbGaP data dictionary itself. For example, here's a link to the data dictionary for the data in the organoid_profiling_pc_subject_phenotypes_gru table.

Note that if you open the link in a web browser it will display as an html table. For API purposes you can read the xml programmatically. One approach is to read the dictionary visually and translate the data in Python or R.

There are some mapping tables that can help which I will add to GA4Gh Search.

@ianfore
Copy link
Collaborator Author

ianfore commented Nov 8, 2020

Created these two tables to do a mapping.
search_cloud.cshcodeathon.md_mapping
search_cloud.cshcodeathon.md_mapping_term
Working on an example to use them.

@ianfore
Copy link
Collaborator Author

ianfore commented Nov 8, 2020

Mapping example added.

Now need to map more columns!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant