generated from carpentries/workbench-template-rmd
-
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Copied all vignettes into episodes directory as a first step. This pr…
…obably won't build, but commiting to have the original files in history.
- Loading branch information
Showing
8 changed files
with
2,281 additions
and
0 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file added
BIN
+96.3 KB
episodes/figures/HCA_sccomp_SUPPLEMENTARY_technical_cartoon_curatedAtlasQuery.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,362 @@ | ||
--- | ||
title: Accessing data from the Human Cell Atlas (HCA) | ||
vignette: > | ||
% \VignetteIndexEntry{Accessing data from the Human Cell Atlas} | ||
% \VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
output: | ||
html_document: | ||
mathjax: null | ||
bibliography: references.bib | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE, cache = TRUE) | ||
``` | ||
|
||
# HCA Project | ||
|
||
The Human Cell Atlas (HCA) is a large project that aims to learn from and map | ||
every cell type in the human body. The project extracts spatial and molecular | ||
characteristics in order to understand cellular function and networks. It is an | ||
international collaborative that charts healthy cells in the human body at all | ||
ages. There are about 37.2 trillion cells in the human body. To read more about | ||
the project, head over to their website at https://www.humancellatlas.org. | ||
|
||
# CELLxGENE | ||
|
||
CELLxGENE is a database and a suite of tools that help scientists to find, | ||
download, explore, analyze, annotate, and publish single cell data. It includes | ||
several analytic and visualization tools to help you to discover single cell | ||
data patterns. To see the list of tools, browse to | ||
https://cellxgene.cziscience.com/. | ||
|
||
# CELLxGENE | Census | ||
|
||
The Census provides efficient computational tooling to access, query, and | ||
analyze all single-cell RNA data from CZ CELLxGENE Discover. Using a new access | ||
paradigm of cell-based slicing and querying, you can interact with the data | ||
through TileDB-SOMA, or get slices in AnnData or Seurat objects, thus | ||
accelerating your research by significantly minimizing data harmonization at | ||
https://chanzuckerberg.github.io/cellxgene-census/. | ||
|
||
# The CuratedAtlasQueryR Project | ||
|
||
To systematically characterize the immune system across tissues, demographics | ||
and multiple studies, single cell transcriptomics data was harmonized from the | ||
CELLxGENE database. Data from 28,975,366 cells that cover 156 tissues (excluding | ||
cell cultures), 12,981 samples, and 324 studies were collected. The metadata was | ||
standardized, including sample identifiers, tissue labels (based on anatomy) and | ||
age. Also, the gene-transcript abundance of all samples was harmonized by | ||
putting values on the positive natural scale (i.e. non-logarithmic). | ||
|
||
To model the immune system across studies, we adopted a consistent immune | ||
cell-type ontology appropriate for lymphoid and non-lymphoid tissues. We applied | ||
a consensus cell labeling strategy between the Seurat blueprint and Monaco | ||
[-@Monaco2019] to minimize biases in immune cell classification from | ||
study-specific standards. | ||
|
||
`CuratedAtlasQueryR` supports data access and programmatic exploration of the | ||
harmonized atlas. Cells of interest can be selected based on ontology, tissue of | ||
origin, demographics, and disease. For example, the user can select CD4 T helper | ||
cells across healthy and diseased lymphoid tissue. The data for the selected | ||
cells can be downloaded locally into popular single-cell data containers. Pseudo | ||
bulk counts are also available to facilitate large-scale, summary analyses of | ||
transcriptional profiles. This platform offers a standardized workflow for | ||
accessing atlas-level datasets programmatically and reproducibly. | ||
|
||
```{r,echo=FALSE} | ||
knitr::include_graphics( | ||
"figures/HCA_sccomp_SUPPLEMENTARY_technical_cartoon_curatedAtlasQuery.png" | ||
) | ||
``` | ||
|
||
# Data Sources in R / Bioconductor | ||
|
||
There are a few options to access single cell data with R / Bioconductor. | ||
|
||
| Package | Target | Description | | ||
|---------|-------------|---------| | ||
| [hca](https://bioconductor.org/packages/hca) | [HCA Data Portal API](https://www.humancellatlas.org/data-portal/) | Project, Sample, and File level HCA data | | ||
| [cellxgenedp](https://bioconductor.org/packages/cellxgenedp) | [CellxGene](https://cellxgene.cziscience.com/) | Human and mouse SC data including HCA | | ||
| [CuratedAtlasQueryR](https://stemangiola.github.io/CuratedAtlasQueryR/) | [CellxGene](https://cellxgene.cziscience.com/) | fine-grained query capable CELLxGENE data including HCA | | ||
|
||
# Installation | ||
|
||
```{r,eval=FALSE} | ||
if (!requireNamespace("BiocManager", quietly = TRUE)) | ||
install.packages("BiocManager") | ||
BiocManager::install("stemangiola/CuratedAtlasQueryR") | ||
``` | ||
|
||
# Package load | ||
|
||
```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} | ||
library(CuratedAtlasQueryR) | ||
library(dplyr) | ||
``` | ||
|
||
# HCA Metadata | ||
|
||
The metadata allows the user to get a lay of the land of what is available | ||
via the package. In this example, we are using the sample database URL which | ||
allows us to get a small and quick subset of the available metadata. | ||
|
||
```{r} | ||
metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL) | ||
``` | ||
|
||
Get a view of the first 10 columns in the metadata with `glimpse` | ||
|
||
```{r} | ||
metadata |> | ||
select(1:10) |> | ||
glimpse() | ||
``` | ||
|
||
# A note on the piping operator | ||
|
||
The vignette materials provided by `CuratedAtlasQueryR` show the use of the | ||
'native' R pipe (implemented after R version `4.1.0`). For those not familiar | ||
with the pipe operator (`|>`), it allows you to chain functions by passing the | ||
left-hand side (LHS) to the first input (typically) on the right-hand side | ||
(RHS). | ||
|
||
In this example, we are extracting the `iris` data set from the `datasets` | ||
package and 'then' taking a subset where the sepal lengths are greater than 5 | ||
and 'then' summarizing the data for each level in the `Species` variable with a | ||
`mean`. The pipe operator can be read as 'then'. | ||
|
||
```{r} | ||
data("iris", package = "datasets") | ||
iris |> | ||
subset(Sepal.Length > 5) |> | ||
aggregate(. ~ Species, data = _, mean) | ||
``` | ||
|
||
# Summarizing the metadata | ||
|
||
For each distinct tissue and dataset combination, count the number of datasets | ||
by tissue type. | ||
|
||
```{r} | ||
metadata |> | ||
distinct(tissue, dataset_id) |> | ||
count(tissue) | ||
``` | ||
|
||
# Columns available in the metadata | ||
|
||
```{r} | ||
head(names(metadata), 10) | ||
``` | ||
|
||
# Available assays | ||
|
||
```{r} | ||
metadata |> | ||
distinct(assay, dataset_id) |> | ||
count(assay) | ||
``` | ||
|
||
# Available organisms | ||
|
||
```{r} | ||
metadata |> | ||
distinct(organism, dataset_id) |> | ||
count(organism) | ||
``` | ||
|
||
## Download single-cell RNA sequencing counts | ||
|
||
The data can be provided as either "counts" or counts per million "cpm" as given | ||
by the `assays` argument in the `get_single_cell_experiment()` function. By | ||
default, the `SingleCellExperiment` provided will contain only the 'counts' | ||
data. | ||
|
||
### Query raw counts | ||
|
||
```{r} | ||
single_cell_counts <- | ||
metadata |> | ||
dplyr::filter( | ||
ethnicity == "African" & | ||
stringr::str_like(assay, "%10x%") & | ||
tissue == "lung parenchyma" & | ||
stringr::str_like(cell_type, "%CD4%") | ||
) |> | ||
get_single_cell_experiment() | ||
single_cell_counts | ||
``` | ||
|
||
### Query counts scaled per million | ||
|
||
This is helpful if just few genes are of interest, as they can be compared | ||
across samples. | ||
|
||
```{r} | ||
metadata |> | ||
dplyr::filter( | ||
ethnicity == "African" & | ||
stringr::str_like(assay, "%10x%") & | ||
tissue == "lung parenchyma" & | ||
stringr::str_like(cell_type, "%CD4%") | ||
) |> | ||
get_single_cell_experiment(assays = "cpm") | ||
``` | ||
|
||
### Extract only a subset of genes | ||
|
||
```{r} | ||
single_cell_counts <- | ||
metadata |> | ||
dplyr::filter( | ||
ethnicity == "African" & | ||
stringr::str_like(assay, "%10x%") & | ||
tissue == "lung parenchyma" & | ||
stringr::str_like(cell_type, "%CD4%") | ||
) |> | ||
get_single_cell_experiment(assays = "cpm", features = "PUM1") | ||
single_cell_counts | ||
``` | ||
|
||
### Extracting counts as a Seurat object | ||
|
||
If needed, the H5 `SingleCellExperiment` can be converted into a Seurat object. | ||
Note that it may take a long time and use a lot of memory depending on how many | ||
cells you are requesting. | ||
|
||
```{r,eval=FALSE} | ||
single_cell_counts <- | ||
metadata |> | ||
dplyr::filter( | ||
ethnicity == "African" & | ||
stringr::str_like(assay, "%10x%") & | ||
tissue == "lung parenchyma" & | ||
stringr::str_like(cell_type, "%CD4%") | ||
) |> | ||
get_seurat() | ||
single_cell_counts | ||
``` | ||
|
||
## Save your `SingleCellExperiment` | ||
|
||
### Saving as HDF5 | ||
|
||
The recommended way of saving these `SingleCellExperiment` objects, if | ||
necessary, is to use `saveHDF5SummarizedExperiment` from the `HDF5Array` | ||
package. | ||
|
||
```{r, eval=FALSE} | ||
single_cell_counts |> saveHDF5SummarizedExperiment("single_cell_counts") | ||
``` | ||
|
||
# Exercises | ||
|
||
1. Use `count` and `arrange` to get the number of cells per tissue in descending | ||
order. | ||
|
||
```{r} | ||
# enter your code here | ||
``` | ||
|
||
<details> <summary> Answer 1 </summary> | ||
|
||
```{r,eval=FALSE} | ||
metadata |> | ||
count(tissue) |> | ||
arrange(-n) | ||
``` | ||
|
||
</details> | ||
|
||
2. Use `dplyr`-isms to group by `tissue` and `cell_type` and get a tally of the | ||
highest number of cell types per tissue combination. What tissue has the most | ||
numerous type of cells? | ||
|
||
```{r} | ||
# enter your code here | ||
``` | ||
|
||
<details> <summary> Answer 2 </summary> | ||
|
||
```{r,eval=FALSE} | ||
metadata |> | ||
group_by(tissue, cell_type) |> | ||
count() |> | ||
arrange(-n) | ||
``` | ||
|
||
</details> | ||
|
||
3. Spot some differences between the `tissue` and `tissue_harmonised` columns. | ||
Use `count` to summarise. | ||
|
||
```{r} | ||
# enter your code here | ||
``` | ||
|
||
<details> <summary> Answer 3 </summary> | ||
|
||
```{r} | ||
metadata |> | ||
count(tissue) |> | ||
arrange(-n) | ||
metadata |> | ||
count(tissue_harmonised) |> | ||
arrange(-n) | ||
``` | ||
|
||
</details> | ||
|
||
To see the full list of curated columns in the metadata, see the Details section | ||
in the `?get_metadata` documentation page. | ||
|
||
4. Now that we are a little familiar with navigating the metadata, let's obtain | ||
a `SingleCellExperiment` of 10X scRNA-seq counts of `cd8 tem` `lung` cells for | ||
females older than `80` with `COVID-19`. Note: Use the harmonized columns, where | ||
possible. | ||
|
||
```{r} | ||
# enter your code here | ||
``` | ||
|
||
<details> <summary> Answer 4 </summary> | ||
|
||
```{r} | ||
metadata |> | ||
dplyr::filter( | ||
sex == "female" & | ||
age_days > 80 * 365 & | ||
stringr::str_like(assay, "%10x%") & | ||
disease == "COVID-19" & | ||
tissue_harmonised == "lung" & | ||
cell_type_harmonised == "cd8 tem" | ||
) |> | ||
get_single_cell_experiment() | ||
``` | ||
|
||
</details> | ||
|
||
# Session Info | ||
|
||
```{r} | ||
sessionInfo() | ||
``` | ||
|
||
# Acknowledgements | ||
|
||
Thank you to [Stefano Mangiola](https://github.com/stemangiola) and his team for | ||
developing | ||
[CuratedAtlasQueryR](https://github.com/stemangiola/CuratedAtlasQueryR) and | ||
graciously providing the content from their vignette. Make sure to keep an eye | ||
out for their publication for proper citation. Their bioRxiv paper can be found | ||
at <https://www.biorxiv.org/content/10.1101/2023.06.08.542671v1>. | ||
|
||
# References |
Oops, something went wrong.