Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial assignment of cell ontology IDs to panglao cell types #909

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Related to #887

What is the goal of this pull request?

Here I am starting the process of assigning cell ontology IDs to the cell types present in the Panglao reference we use when running CellAssign as part of scpca-nf. This initial PR does some preliminary R setup and adds a script to assign ontology IDs to cell types that have exact matches.

Briefly describe the general approach you took to achieve this goal.

  • I created an R project for this module so that file is being added here. This also includes additions to the lock file.
  • I wrote a script that takes as input the reference file from PanglaoDB and returns a TSV file with the ontology ID, human readable value and original value in the Panglao ref file for all cell types.
  • Before doing any matching I did do some minor modifying of the original cell type names. I made everything lower case since that's how the names appear in CL (with the exception of B and T cells which have a capital B and T) and I also made everything singular. All the cell types either ended with "cells" (endothelial cells vs. endothelial cell) or ended in s (neurons vs. neuron). I looked at all the cell types to make sure that none of the ones listed actually end in "s" as part of the cell type name.
  • The output of this script is a table that has ontology IDs for any matches and NA for any of the ones that don't match. I am thinking we can then fill out this table and replace any of the NAs with the ID we feel is most appropriate. That part needs to be done manually so the output of this script is going to get modified outside of running this script. To account for us making changes to the table manually, I added a check for if the file already exists. If that's the case then any cell types that are in the existing file are removed prior to joining with the CL labels. Then they are added back to the table. This means we can manually fill in an ontology ID and then still re-run this script and not lose that information. I don't know how often we would actually re-run this script but it seemed easy to implement vs. accidentally over writing something in the future.

I did want to point out that currently the actual reference file from Panglao is not in the repo because it's ~ 1000 KB and we have a 200 KB limit on TSV files. How do we want to proceed here? We probably don't really need it and I could just make a list of the cell types and save as a text file to read in and point to where the file lives in scpca-nf or we could just make an exception for this file?

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes. After assigning labels this way there are 92 cell types that will need to be manually assigned. I'm thinking I'll break this up into ~ 4-5 PRs and do 20-25 cell types at a time.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Any packages needed to run this script have been recorded in the lock file.

Are there particular areas you'd like reviewers to have a close look at?

I first want to get some feedback on the overall approach here and make sure we are okay with the decisions I made in the script. Are there things you would change about the overall setup here? Once we are on the same page regarding building this new file with the ontology IDs and the use of the script I'll add to the README and document this process. We can do that in this PR or a new PR.

Is there anything that you want to discuss further?

Any thoughts on how to handle storing the large Panglao file?

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your approach. I am not approving yet because we should address what to do about the Panglao DB file that's too big before merging. I added my thoughts in an inline comment.

module_base <- rprojroot::find_root(rprojroot::is_renv_project)

# read in original ref file
ref_file <- file.path(module_base, "references", "PanglaoDB_markers_2020-03-27.tsv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts:

  • Do we want to include a data download script that grabs this from scpca-nf (provided I interpreted your comment I include below correctly) in this module?
  • Should we explicitly ignore this file in this module?

I think that's how I'd address this concern:

I did want to point out that currently the actual reference file from Panglao is not in the repo because it's ~ 1000 KB and we have a 200 KB limit on TSV files. How do we want to proceed here? We probably don't really need it and I could just make a list of the cell types and save as a text file to read in and point to where the file lives in scpca-nf or we could just make an exception for this file?

@allyhawkins
Copy link
Member Author

A few thoughts:

  • Do we want to include a data download script that grabs this from scpca-nf (provided I interpreted your comment I include below correctly) in this module?
  • Should we explicitly ignore this file in this module?

I liked the idea of adding a script so I did that and then also included the file in the gitignore for this module. I also added a readme for the scripts and references folders. That should at least get us started on having documentation but I imagine I will expand/ update those readmes as we continue on this process. @jaclyn-taroni this should be ready for another look.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


# define path to ref file and url
ref_file="${scripts_dir}/../references/PanglaoDB_markers_2020-03-27.tsv"
ref_url="https://raw.githubusercontent.com/AlexsLemonade/scpca-nf/refs/heads/main/references/PanglaoDB_markers_2020-03-27.tsv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose main is fine here (instead of a permalink) since the file name captures version information, and we'd probably want to use whatever the current version of this file is anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we'd probably want to use whatever the current version of this file is anyway.

This exact thought was my reasoning for using main here.

@allyhawkins allyhawkins merged commit 37423e7 into AlexsLemonade:main Nov 26, 2024
2 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/panglao-ontology-assignment-script branch November 26, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants