Initial assignment of cell ontology IDs to panglao cell types #909

allyhawkins · 2024-11-22T22:00:07Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Related to #887

What is the goal of this pull request?

Here I am starting the process of assigning cell ontology IDs to the cell types present in the Panglao reference we use when running CellAssign as part of scpca-nf. This initial PR does some preliminary R setup and adds a script to assign ontology IDs to cell types that have exact matches.

Briefly describe the general approach you took to achieve this goal.

I created an R project for this module so that file is being added here. This also includes additions to the lock file.
I wrote a script that takes as input the reference file from PanglaoDB and returns a TSV file with the ontology ID, human readable value and original value in the Panglao ref file for all cell types.
Before doing any matching I did do some minor modifying of the original cell type names. I made everything lower case since that's how the names appear in CL (with the exception of B and T cells which have a capital B and T) and I also made everything singular. All the cell types either ended with "cells" (endothelial cells vs. endothelial cell) or ended in s (neurons vs. neuron). I looked at all the cell types to make sure that none of the ones listed actually end in "s" as part of the cell type name.
The output of this script is a table that has ontology IDs for any matches and NA for any of the ones that don't match. I am thinking we can then fill out this table and replace any of the NAs with the ID we feel is most appropriate. That part needs to be done manually so the output of this script is going to get modified outside of running this script. To account for us making changes to the table manually, I added a check for if the file already exists. If that's the case then any cell types that are in the existing file are removed prior to joining with the CL labels. Then they are added back to the table. This means we can manually fill in an ontology ID and then still re-run this script and not lose that information. I don't know how often we would actually re-run this script but it seemed easy to implement vs. accidentally over writing something in the future.

I did want to point out that currently the actual reference file from Panglao is not in the repo because it's ~ 1000 KB and we have a 200 KB limit on TSV files. How do we want to proceed here? We probably don't really need it and I could just make a list of the cell types and save as a text file to read in and point to where the file lives in scpca-nf or we could just make an exception for this file?

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes. After assigning labels this way there are 92 cell types that will need to be manually assigned. I'm thinking I'll break this up into ~ 4-5 PRs and do 20-25 cell types at a time.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Any packages needed to run this script have been recorded in the lock file.

Are there particular areas you'd like reviewers to have a close look at?

I first want to get some feedback on the overall approach here and make sure we are okay with the decisions I made in the script. Are there things you would change about the overall setup here? Once we are on the same page regarding building this new file with the ontology IDs and the use of the script I'll add to the README and document this process. We can do that in this PR or a new PR.

Is there anything that you want to discuss further?

Any thoughts on how to handle storing the large Panglao file?

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jaclyn-taroni

I agree with your approach. I am not approving yet because we should address what to do about the Panglao DB file that's too big before merging. I added my thoughts in an inline comment.

jaclyn-taroni · 2024-11-25T19:29:30Z

analyses/cell-type-consensus/scripts/prepare-cell-type-ontologies.R

+module_base <- rprojroot::find_root(rprojroot::is_renv_project)
+
+# read in original ref file 
+ref_file <- file.path(module_base, "references", "PanglaoDB_markers_2020-03-27.tsv")


A few thoughts:

Do we want to include a data download script that grabs this from scpca-nf (provided I interpreted your comment I include below correctly) in this module?

Should we explicitly ignore this file in this module?

I think that's how I'd address this concern:

I did want to point out that currently the actual reference file from Panglao is not in the repo because it's ~ 1000 KB and we have a 200 KB limit on TSV files. How do we want to proceed here? We probably don't really need it and I could just make a list of the cell types and save as a text file to read in and point to where the file lives in scpca-nf or we could just make an exception for this file?

…nglao-ontology-assignment-script

allyhawkins · 2024-11-25T23:03:33Z

A few thoughts:

Do we want to include a data download script that grabs this from scpca-nf (provided I interpreted your comment I include below correctly) in this module?

Should we explicitly ignore this file in this module?

I liked the idea of adding a script so I did that and then also included the file in the gitignore for this module. I also added a readme for the scripts and references folders. That should at least get us started on having documentation but I imagine I will expand/ update those readmes as we continue on this process. @jaclyn-taroni this should be ready for another look.

jaclyn-taroni

LGTM

jaclyn-taroni · 2024-11-25T23:56:42Z

analyses/cell-type-consensus/scripts/00-download-panglao-ref.sh

+
+# define path to ref file and url
+ref_file="${scripts_dir}/../references/PanglaoDB_markers_2020-03-27.tsv"
+ref_url="https://raw.githubusercontent.com/AlexsLemonade/scpca-nf/refs/heads/main/references/PanglaoDB_markers_2020-03-27.tsv"


I suppose main is fine here (instead of a permalink) since the file name captures version information, and we'd probably want to use whatever the current version of this file is anyway.

and we'd probably want to use whatever the current version of this file is anyway.

This exact thought was my reasoning for using main here.

allyhawkins added 4 commits November 22, 2024 15:16

add an Rproj file

75c1ffc

update lock file

b4a681e

script for matching ontology IDs

fabf02b

matched ontology IDs

6fddbf4

allyhawkins requested a review from jaclyn-taroni as a code owner November 22, 2024 22:00

jaclyn-taroni reviewed Nov 25, 2024

View reviewed changes

allyhawkins added 5 commits November 25, 2024 16:19

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/pa…

9b84fee

…nglao-ontology-assignment-script

ignore panglao ref

f75eee7

script to download panglao ref

fb65c8b

add numbers to scripts

c9e3051

add some initial readmes

f26e665

allyhawkins requested a review from jaclyn-taroni November 25, 2024 23:03

jaclyn-taroni approved these changes Nov 25, 2024

View reviewed changes

Merge branch 'main' into allyhawkins/panglao-ontology-assignment-script

de30d04

allyhawkins merged commit 37423e7 into AlexsLemonade:main Nov 26, 2024
2 checks passed

allyhawkins deleted the allyhawkins/panglao-ontology-assignment-script branch November 26, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial assignment of cell ontology IDs to panglao cell types #909

Initial assignment of cell ontology IDs to panglao cell types #909

allyhawkins commented Nov 22, 2024

jaclyn-taroni left a comment

jaclyn-taroni Nov 25, 2024

allyhawkins commented Nov 25, 2024

jaclyn-taroni left a comment

jaclyn-taroni Nov 25, 2024

allyhawkins Nov 26, 2024

Initial assignment of cell ontology IDs to panglao cell types #909

Initial assignment of cell ontology IDs to panglao cell types #909

Conversation

allyhawkins commented Nov 22, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particular areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jaclyn-taroni left a comment

Choose a reason for hiding this comment

jaclyn-taroni Nov 25, 2024

Choose a reason for hiding this comment

allyhawkins commented Nov 25, 2024

jaclyn-taroni left a comment

Choose a reason for hiding this comment

jaclyn-taroni Nov 25, 2024

Choose a reason for hiding this comment

allyhawkins Nov 26, 2024

Choose a reason for hiding this comment