Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft annotation #844

Merged
merged 29 commits into from
Nov 1, 2024
Merged

Draft annotation #844

merged 29 commits into from
Nov 1, 2024

Conversation

maud-p
Copy link
Contributor

@maud-p maud-p commented Oct 29, 2024

Purpose/implementation Section

In this PR, I like to make a first draft of annotations for the Wilms tumor 06 dataset

Please link to the GitHub issue that this pull request addresses.

I opened the issue:
#839

What is the goal of this pull request?

To sumarize the analysis performed so far and try to combine them to annotate the Wilms tumor dataset.

Briefly describe the general approach you took to achieve this goal.

The aim is to combine label transfer and CNV inference to annotate Wilms tumor samples in SCPCP000006. The proposed annotation will be based on the combination of:

the label transfer from the fetal kidney reference (Stewart et al.), in particular the fetal_kidney_predicted.compartment and fetal_kidney:predicted.cell_type, as well as the mapping.score for each compartment,

the predicted CNV calculated using intra-sample endothelial and immune cells (--reference both) as normal reference

In a second time, we will explore and validate the chosen annotation.

We will use some of the markers genes to validate visually the annotations.

The analysis can be summarized as the following:

Where cnv.thr and map.thr need to be discussed

<style> </style>
first level annotation second level annotation selection of the cells marker genes for validation cnv validation
normal endothelial compartment == "endothelium" & mapping_score > map.thr & cnv_score < cnv.thr WVF no cnv
normal immune compartment == "immune" & mapping_score > map.thr & cnv_score < cnv.thr PTPRC, CD163, CD68 no cnv
normal kidney cell_type %in% c("kidney cell", "kidney epithelial", "podocyte") & mapping_score > map.thr & cnv_score < cnv.thr CDH1, PODXL, LTL no cnv
normal stroma compartment == "stroma" & mapping_score > map.thr & cnv_score < cnv.thr VIM no cnv
cancer stroma compartment == "stroma" & cnv_score > cnv.thr VIM proportion_cnv_chr -1 -4 -11 -16 -17 -18
cancer blastema compartment == "fetal_nephron" & cell_type == "mesenchymal cell" & cnv_score > cnv.thr CITED1 proportion_cnv_chr -1 -4 -11 -16 -17 -18
cancer epithelial compartment == "fetal_nephron" & cell_type != "mesenchymal cell" & cnv_score > cnv.thr CDH1 proportion_cnv_chr -1 -4 -11 -16 -17 -18
unknown - the rest of the cells - proportion_cnv_chr -1 -4 -11 -16 -17 -18

If known, do you anticipate filing additional pull requests to complete this analysis module?

I think quite some points need to be discussed and can be improved or checked in later analyses .

Provide directions for reviewers

I think the present notebook is not completely done, but I wanted to share with you what I have been able to sumarized and explore so far.
Happy to discuss about every steps and how it can be improved.

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@jaclyn-taroni jaclyn-taroni requested review from sjspielman and removed request for jaclyn-taroni October 30, 2024 09:51
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is a really nice first draft of annotations! I've left some initial feedback about where I think we can make the code more robust, and some spots where I have questions about the approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently has the same file name as the 04 notebook, except with 07. We definitely want to more clearly distinguish these, so can you rename this one? Maybe like combined annotation across samples, since it's more than just label transfer?


```{r fig.width=10, fig.height=10, out.width='100%', results='asis'}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, you'll want to use dplyr::case_when() here

cell_type_df$first.level_annotation <- "unknown"


# Define normal cells
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I have here (and more generally for the notebook) is whether you want to use the scores when assigning labels based on results from label transfer.

For example, in your first condition here, you check whether a cell is fetal nephron or stroma. Cells which have scores of say 0.1 (aka, not very confident!) but are labeled nephron will be regarded as nephron, but there is an approach where you might say "any cell with a score below has an UNKNOWN compartment", and then not label these cells at all. This would be a separate condition: If the score for the compartment is less than a certain value, then keep that cell as unknown since we don't have reliable label transfer results.

That said, I don't think this matters quite so much for compartment, since those results are more reliable, but it may matter for the cell_type annotations from label transfer which has many more categories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, thanks!
I would however only apply it to "normal" cells, as we expect cancer cells to have lower predicted.score?

What about having at the end a quick check of the density of the predicted.scores for each of the first/second.level_annotation and filter out some of the annotations with too low confidency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would however only apply it to "normal" cells, as we expect cancer cells to have lower predicted.score?

This makes sense to me, but please add a sentence that explicitly says this in the notebook about this expectation. I see you've added something about using the scores for normal cells (great!) so let's add this explanation too for why you don't use them for cells we're calling as cancer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about having at the end a quick check of the density of the predicted.scores for each of the first/second.level_annotation and filter out some of the annotations with too low confidency?

I think it would certainly be worth looking at the distribution of scores here, and then we can think about filtering. But, I might open a separate issue for this as something to circle back to after the deadline!



```{r fig.width=20, fig.height=20, out.width='100%', results='asis'}
ggplot(cell_type_df[cell_type_df$first.level_annotation == "normal",], aes( x = umap.umap_1, y = umap.umap_2, color = second.level_annotation), shape = 19, size = 1)+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is informative to only show normal cells (and similarly below in your next plot to only show cancer cells). Can you explain more of your reasoning for these plots so I understand how they help interpretation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most of the cancer types, I guess we are used to have cancer cells that cluster separetly from normal cells.

For Wilms tumor, I think this is however more complicated as cancer cells comprise epithelial, stroma and blastema cancer cells, that have more (transcriptional) similarities with their normal conterparts (i.e. normal kidney epithelium, normal reactive stroma) than between them.

For that reason, I expect for example epithelial cancer and normal cells to be close, if not mixed, in the umap reduction.

I then found easier to visualize cancer from normal cells separatly. But might be actually better to have to two plots side by side for each of the patient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what might help here is plotting a strategy I have used before to highlight cells in a UMAP - you can make all cells light gray, and then on top of that add a layer of your cells of interest that are colored. This way, you can clearly see the cells you care about, but still see the full context of the UMAP.

Here's an example of how you might code something like this:

# data frame that only contains points of interest
subsetted_iris <- iris |>
  filter(Species == "versicolor")

ggplot(iris) + 
  aes(x = Sepal.Length, y =  Sepal.Width) + 
  geom_point(color = "gray") + 
  # add layer with points of interest colored
  geom_point(
    data = subsetted_iris, 
    aes(color = Species)
  ) 

Again though, this might be something to do later after the deadline!


```{r fig.width=10, fig.height=10, out.width='100%', results='asis'}

cell_type_df$second.level_annotation[ cell_type_df$compartment %in% c("stroma") &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, please use case_when

maud-p and others added 5 commits October 30, 2024 22:48
…s_Samples_exploration.Rmd

Co-authored-by: Stephanie Spielman <[email protected]>
…s_Samples_exploration.Rmd

Co-authored-by: Stephanie Spielman <[email protected]>
…s_Samples_exploration.Rmd

Co-authored-by: Stephanie Spielman <[email protected]>
…s_Samples_exploration.Rmd

Co-authored-by: Stephanie Spielman <[email protected]>
@maud-p
Copy link
Contributor Author

maud-p commented Oct 30, 2024

Thank you @sjspielman for looking into it! I just pushed the few changes. Thank you for the case_when suggestion, I didn't know it. I should definitly use dplyr more 😃
Let me know if something/answer are not clear !
Thanks!

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the case_when suggestion

case_when is so useful, glad I could introduce you to it 😄

I left a few comments about other areas I think this can be improved, but overall I think this notebook is a great "first draft" of your annotations and you should plan (if you want to still contribute!) to do some of the additional code changes later since you may not have time before the deadline. In this case, I would encourage you to open an issue where you can track future updates to this notebook. You can always copy and paste some of my comments there to help write the issue, too! Another benefit to writing this issue is just so everyone knows there is potentially more work planned for this module (even if that work doesn't happen, that's ok! we still will have the record of discussing it!).

For now, here's what we need at least:

  • Please make the corresponding change to fix the has_cnv_score "bug" in the 06 notebook, so it uses <= and > instead of :.
  • I do not see any code that actually runs inferCNV across all samples, only the 5 we have explored more in depth. This needs to be part of the workflow. It's clear that you've run the code, so maybe it just isn't committed?
  • This notebook needs to be added to the workflow and to the module's README file
  • Remove the old HTML that is still in this PR (from before you renamed the notebook)
  • Have this notebook export a TSV file of draft annotations that meets these guidelines: https://openscpca.readthedocs.io/en/latest/grant-opportunities/#submission-acceptance-criteria. You can still have your first-level and second-level annotations, but we'll want the columns described in the link above too. Since you did not actually use marker genes for the annotation, just to explore annotation and do a little bit of validation, you won't need to make that second TSV described in the link.

@sjspielman
Copy link
Member

@maud-p in the interest of time given the deadline, I'm going to go ahead and push code to your branch that addresses a couple of my reviews, including:

  • I do not see any code that actually runs inferCNV across all samples, only the 5 we have explored more in depth. This needs to be part of the workflow. It's clear that you've run the code, so maybe it just isn't committed?
  • This notebook needs to be added to the workflow and to the module's README file
  • Remove the old HTML that is still in this PR (from before you renamed the notebook)
  • Have this notebook export a TSV file of draft annotations that meets these guidelines: https://openscpca.readthedocs.io/en/latest/grant-opportunities/#submission-acceptance-criteria. You can still have your first-level and second-level annotations, but we'll want the columns described in the link above too. Since you did not actually use marker genes for the annotation, just to explore annotation and do a little bit of validation, you won't need to make that second TSV described in the link.

Then, I will be able to approve this PR which will hopefully make your results eligible in time :)

@maud-p
Copy link
Contributor Author

maud-p commented Oct 31, 2024

@sjspielman thank you so much for your help!
I am trying to catch up on the review of this PR, but I see you are really advanced in the changes!
I think what is remaining to do is the README.md file update and the final annotation tsv file, I will start with this now, is that OK?
Or would it introduce conflicts?
Thank you again so much , really appreciated :)

@sjspielman
Copy link
Member

Or would it introduce conflicts?

It would definitely introduce conflicts, since along the way I have caught a few bugs and am addressing them too. I will report back here tomorrow with the full details, since I'm still working on it, but I will handle the TSV from here!

@maud-p
Copy link
Contributor Author

maud-p commented Oct 31, 2024

Or would it introduce conflicts?

It would definitely introduce conflicts, since along the way I have caught a few bugs and am addressing them too. I will report back here tomorrow with the full details, since I'm still working on it, but I will handle the TSV from here!

OK thank you very much! Don't hesitate to let me know at the end of your working day what/if I can continue tomorrow morning (CEST time)!
Thank you!

@sjspielman
Copy link
Member

I have implemented the following changes:

  • Updated documentation in the module to reflect the 07 notebook
  • Updated the 07 notebook to export a properly-formatted TSV of annotations
  • Fixed the 0:threshold bug in the 06 notebook
  • Fixed a few debugs for the inferCNV script which were previously missed because all samples we had run through had normal cells for a reference
  • Updated 00_run_workflow.R:
    • Added step to process all samples, where currently possible, through inferCNV with HMM i3 and "both" reference
    • Added step to render the 07 notebook
    • Fixed the for loops to only loop over relevant samples and not duplicate samples

This PR can therefore be approved! 🎉


These are the additional review comments which have not been implemented:

@maud-p, you may wish to open a new issue about addressing these comments or other future steps that you might be interested in doing in the future! But, we don't want them in this PR since it's being approved, and we'd like to merge it in to meet the deadline. Either way, thank you again for all your time and effort to get this draft of annotations done 🎆 🥳 !!!

@sjspielman sjspielman self-requested a review November 1, 2024 13:47
…, so we can get some draft annotations for it
@maud-p
Copy link
Contributor Author

maud-p commented Nov 1, 2024

@sjspielman thank you so much for all your reviews, advices etc. I am really happy about the job we did together 🥳 thanks +++ for your great great help these last days to meet the deadline!!!!
I'll open the next issue on Monday/Tuesday, I'd like to take the time to think and summarize what/how I like to pursue this analysis. But definitly I'd like to continue 😃
Thank you!!!

@sjspielman
Copy link
Member

Noting that I was also able to get one more sample running through inferCNV, so now only 1 sample remains unannotated and it's probably due to some cryptic bug in inferCNV which can be investigated in the future.

@sjspielman
Copy link
Member

Alright, I was able to get the last sample working!! All samples now have a draft annotation 🎉

@maud-p
Copy link
Contributor Author

maud-p commented Nov 1, 2024

Alright, I was able to get the last sample working!! All samples now have a draft annotation 🎉

Thank you!!!!

@sjspielman sjspielman merged commit 2e289a2 into AlexsLemonade:main Nov 1, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants