Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672

maud-p · 2024-07-29T13:51:28Z

Purpose/implementation Section

In this module 1, I create 2 metadata tables to compile from the literature information on marker genes and known genetic alterations, that will be used later to validate annotations of the Wilms tumor dataset.

Please link to the GitHub issue that this pull request addresses.

#671
#635 (reply in thread)

What is the goal of this pull request?

Wilms tumor (WT) is the most common pediatric kidney cancer characterized by an exacerbated intra- and inter- tumor heterogeneity. The genetic landscape of WT is very diverse in each of the histological contingents. The COG classifies WT patients into two groups: the favorable histology and diffuse anaplasia. Each of these groups is composed of the blastemal, epithelial, and stromal populations of cancer cells in different proportions, as well as cells from the normal kidney, mostly kidney epithelial cells, endothelial cells, immune cells and normal stromal cells (fibroblast).

In this module, we reviewed the literature to compile a table of marker genes for each of the expected cell types in the dataset. Additionally, we provide a table of know genetic alterations in Wilms tumor that can be useful to validate CNV profiles obtained after running inferCNV function.

Briefly describe the general approach you took to achieve this goal.

The table CellType_metadata.csv contains the following column and information:

"gene_symbol" contains the symbol of the described gene, using the HUGO Gene Nomenclature
ENSEMBL_ID contains the stable identifier from the ENSEMBL database
cell_class is either "malignant" for marker genes specific to malignant population, or "non-malignant" for markers genes specific to non-malignant tissue or "both" for marker genes that can be found in malignant as well as non-malignant tissue but are still informative in respect to the cell type.
cell_type contains the list of the cell types that are attributed to the marker gene
DOI contains the list of main publication identifiers supporting the choice of the marker gene
comment can be empty or contains any additional information

The table GeneticAlterations_metadata.csv contains the following column and information:

alteration contains the number and portion of the affected chromosome
gain_loss contains the information regarding the gain or loss of the corresponding genetic alteration
cell_class is "malignant"
cell_type contains the list of the malignant cell types that are attributed to the marker gene, either blastemal, stromal, epithelial or NA if none of the three histology is more prone to the described genetic alteration
DOI contains the list of main publication identifiers supporting the choice of the genetic alteration
comment can be empty or contains any additional information

If known, do you anticipate filing additional pull requests to complete this analysis module?

This module will be used for later validation of the annotations and results from inferCNV.

What is the name of your results bucket on S3?

Results should be uploaded to your bucket so they are available during review.
See here for instructions on how to upload to your bucket:
https://openscpca.readthedocs.io/en/latest/software-platforms/aws/working-with-s3-buckets/

What types of results does your code produce (e.g., table, figure)?

2 tables

Provide directions for reviewers

This section had 2 aims:

learn how to build the github repository, perform issue, pull request
gather literature information into a metadata file for later use for validation of the annotations

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
[x ] The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jaclyn-taroni

Hi @maud-p, thanks so much for filing this! We were introduced somewhat via email, but I’m Jaclyn, Director of the Childhood Cancer Data Lab 👋🏻.

First, I want to commend you for putting together a pull request that’s a reviewer/organizer’s dream for the following reasons:

You filed an issue before this pull request, so we knew to expect it and were prepared to review it.

Your initial comment on the pull request (Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672 (comment)) includes a lot of information that helps with the review.
The size/scope of the pull request is just right. We have enough context to provide a substantive review, but it’s not so big that it’s hard to thoroughly review it in one sitting. That keeps things moving, which helps us, and we hope you also!

Because you’ve done a great job explaining your plan in your discussion/issue posts, I will start by providing some high-level feedback that I think will help us accomplish later steps.

Module naming and organization

Just a note on language — we often call a directory or folder that contains an analysis a “module” in our docs, and I will probably use “analysis,” “folder,” and “module” somewhat interchangeably in my review.

Naming

I recommend renaming your folder to something more descriptive to help us stay organized. For example, cell-type-wilms-tumor-06 would help anyone reading the code base know that the module performs cell typing on a Wilms Tumor project, and since we have two as part of ScPCA (SCPCP000014 and SCPCP000006), I think the -06 will help distinguish them.

Organization

We think of folders in analyses as standalone units: all the steps needed to complete an analysis—in your case, cell typing samples in SCPCP000006—will live in one directory/folder.

One of the reasons to organize work this way is that we’ll set up a workflow to ensure that all the steps of your analysis can be run on test data. (You can read more about that here if you’re interested!) Big picture: it helps us maintain your module over time if, for example, something in the data release changes. We know you’ll have invested a lot of effort into your module, so we want to make sure if we break something, we know about it and can fix it!

All that being said — given what you and @allyhawkins discussed in #635, I might expect the next few steps to result in a folder structure that looks something like the following:

cell-type-wilms-tumor-06
├── scripts
│   └── clustering.R
│   └── label-transfer.R
│   └── ...
├── results
│   └── README.md
├── marker-sets
│   └── CellType_metadata.csv
│   └── GeneticAlterations_metadata.csv
├── plots
│   └── ...
├── scratch
│   └── ...
├── 01-explore-cluster-labels.Rmd
├── 02-annotate-normal-cells.Rmd
├── ...
├── README.md
├── Dockerfile
├── .gitignore
└── .dockerignore

This might not be exactly the right way to organize the module, but the main points are:

All of the scripts, notebooks, etc., you’ll use to perform cell typing are all in one folder, for example:
- Clustering
- Label transfer
If you are committing the marker CSVs to the repo (and you should!), let’s have you move that to a folder called marker-sets instead of putting them in results.
- We generally want to avoid committing anything in the results directory besides the README.md.
  
  That’s because we expect that results files might be large, so we’ll ask people to put them in their S3 bucket instead of tracking them in the repo (https://openscpca.readthedocs.io/en/latest/aws/working-with-s3-buckets/).
  
  Side note: That’s what these lines are intended to accomplish:
  
  OpenScPCA-analysis/analyses/module1/.gitignore
  
  Lines 1 to 3 in 81e1478
  
  # Results should not be committed
  
  /results/*
  
  !/results/README.md

I’ll also note that we expect everyone’s analyses to “share” or use the same download data script in the root of the repository (download-data.py), so you can delete the data/download-data.py file.

Docker

I know you and @jashapiro have discussed this a bit in #671, so I’m going to quote him here:

For best compatibility with the other packages currently in use, you might consider using Bioconductor 3.19 and R 4.4. We use these in part because of a known security vulnerability in R <4.4.

For easiest implementation that saves on some installation time, you might consider using the bioconductor/tidyverse:3.19 image for your development.

And I know you are interested in using that moving forward from your comment (#671 (comment))!

Similar to how we’ll organize all the scripts and notebooks needed for an analysis into one folder, we’d expect the Dockerfile in your module to contain all the software dependencies required for running the entire module for many of the same reasons (i.e., to make sure we can run all the steps over the project’s “lifespan”).

You are not adding any code here yet, so I think you can remove the Dockerfile and add one later as you add code. It’s also okay to keep it in here to serve as a reference in the future; you can always edit it to use bioconductor/tidyverse:3.19. We are also happy to take the lead on maintaining the Dockerfile if you’d prefer to focus on the analyses. It’s really up to you based on what you think would be easiest!

Summary/next steps

I’m going to list the next steps here since I know I’ve written a lot, and some of what I’ve written doesn’t require any action!

Rename the folder to cell-type-wilms-tumor-06 or something similarly descriptive
Move the CSV files to a new folder called marker-sets
- CellType_metadata.csv
- GeneticAlterations_metadata.csv
Remove the data download script included here (data/download-data.py)

Optional:

Remove the Dockerfile with the idea that you’ll add a new one later as you add code to the module

Future thought:

Plan to add scripts/materials/etc. for the rest of the steps you’ve proposed to the cell-type-wilms-tumor-06 folder (or whatever name you choose)

If you have any questions about this, please let us know. We are here to help!

I expect the next round of review will focus on the scientific content of what you’re adding here.

Again, awesome job with your first pull request. We’re thrilled to have you aboard! 😄

maud-p · 2024-07-30T19:22:14Z

Dear Jaclyn,

Thank you very much for your encouraging comment and all the information. It is really useful to understand the expected structure of the final folder/module.
I will commit the expected changes asap and pursue the clustering analysis.

My understanding of the reviewing process so far is:

I commit the changes that you requested and re-request a review until we are fine with the changes on both side and then I can close the pull request / merge the commit into the AlexsLemonade:main.
before starting a new part of the analysis (like step 2 clustering), I initiate a new issue to describe the plan. Once I am done with the analysis I submit another pull request linked to this new issue. And back to 1)
Did I get it right?

Regarding the maintenance of the Dockerfile, thank you very much for your offer. I would like to try to do it, but if it starts being double work from your side checking and advising on it than maintaining it, please just let me know!

Thank you again.

jaclyn-taroni · 2024-08-01T12:13:32Z

I've looked at the commit history locally. One way to "remove" the clustering changes from this pull request would be to have you create a new branch and then refile the pull request (i.e., you close this one, and we start a new one).

The way you could do that is with the following steps.

First, you'd make sure you're on your main branch:

git checkout main

Then you're going to create a new branch (here I've called it start-wilms-analysis) at a place in the Git history before you added the clustering analysis:

git checkout -b start-wilms-analysis b754e5de88d7ec9d99be0b50db00e34d0b183a4b

Then you can push the new branch to GitHub with:

git push -u origin start-wilms-analysis

Then, you can file a new pull request using the new branch (start-wilms-analysis) from the GitHub UI.

You could largely copy and paste your initial comment when you file, and this closed PR here would retain the record of our conversation.

For now, I think we could plan to leave #680 as is and just make sure we don't merge it until the new PR goes into AlexsLemonade/main.

What do you think of this plan, @maud-p?

maud-p · 2024-08-01T12:24:59Z

Sounds good thank you @jaclyn-taroni for the precise steps :) I'll do it in a minute.

Regarding the #680 I think I find a way to add my commit to the maud-p-01-clustering branch now !

maud-p · 2024-08-01T12:26:14Z

one question @jaclyn-taroni , for the next step, each time I start a new step in the analysis, I should:

generate a new issue
generate a new branch and work on it
Correct?

jaclyn-taroni · 2024-08-01T12:37:08Z

one question, for the next step, each time I start a new step in the analysis, I should:
* generate a new issue

* generate a new branch and work on it
  Correct?

Yes, that's right! More completely:

New issue
New branch
New pull request from the branch

So we're aiming for 1 issue:1 pull request, but it does not always work out that cleanly. If an issue is particularly "big" (like it will require two+ scripts or notebooks to accomplish), you might end up with something that looks like:

New issue
First branch
First pull request from the first branch
Second branch is created from the first branch
Second pull request from the second branch

And that's totally okay!

I think the most important takeaways are that your reviewers have enough information to do a good job with their review (e.g., context about your scientific goals), and the pull requests are a manageable size. ~400 lines that need to be reviewed or one script/notebook are some rules of thumb you can use for what is a "manageable size."

jaclyn-taroni · 2024-08-01T12:39:40Z

Closing in favor of #681.

maud-p and others added 11 commits July 23, 2024 11:44

Create download-data.py

c1071c3

create module 1

81c1712

Update Dockerfile

caddbbf

Update README.md

c29e1cb

Update README.md

999d2f2

Add files via upload

671d4d7

Update GeneticAlterations_metadata.csv

ef22cb6

Update README.md

ab263d3

Update README.md

201bd53

Update README.md

3f7d64b

Update README.md

81e1478

maud-p requested a review from allyhawkins as a code owner July 29, 2024 13:51

jaclyn-taroni requested review from jaclyn-taroni and removed request for allyhawkins July 29, 2024 18:46

jaclyn-taroni reviewed Jul 30, 2024

View reviewed changes

maud-p and others added 5 commits July 30, 2024 21:53

changes to issue AlexsLemonade#671

b754e5d

Update README.md

19337e6

Update README.md

9c4c080

Update README.md

b3c8f5d

Update README.md

42c0b61

maud-p requested a review from jaclyn-taroni July 30, 2024 20:29

maud-p and others added 8 commits August 1, 2024 10:24

adding 01-clustering analysis

2fd7b8d

Update README.md

164d63f

Update README.md

f3e7d92

Update Dockerfile

5d20458

adding_sample_metadata

c5f3d77

Resolve merge conflict by incorporating both suggestions

87a6a40

Resolve merge conflict by incorporating both suggestions

28b484e

Resolve merge conflict by incorporating both suggestions

e98225c

Resolve merge conflict by incorporating both suggestions

f1fc58b

jaclyn-taroni mentioned this pull request Aug 1, 2024

Start wilms analysis #681

Merged

7 tasks

jaclyn-taroni closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672

Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672

maud-p commented Jul 29, 2024

jaclyn-taroni left a comment

maud-p commented Jul 30, 2024 •

edited

Loading

jaclyn-taroni commented Aug 1, 2024

maud-p commented Aug 1, 2024 •

edited

Loading

maud-p commented Aug 1, 2024 •

edited

Loading

jaclyn-taroni commented Aug 1, 2024

jaclyn-taroni commented Aug 1, 2024

	# Results should not be committed
	/results/*
	!/results/README.md

Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672

Metadata file: compilation of a metadata file of marker genes for expected cell types that will be used for validation at a later step #672

Conversation

maud-p commented Jul 29, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Module naming and organization

Naming

Organization

Docker

Summary/next steps

maud-p commented Jul 30, 2024 • edited Loading

jaclyn-taroni commented Aug 1, 2024

maud-p commented Aug 1, 2024 • edited Loading

maud-p commented Aug 1, 2024 • edited Loading

jaclyn-taroni commented Aug 1, 2024

jaclyn-taroni commented Aug 1, 2024

maud-p commented Jul 30, 2024 •

edited

Loading

maud-p commented Aug 1, 2024 •

edited

Loading

maud-p commented Aug 1, 2024 •

edited

Loading