Skip to content

Commit

Permalink
Copied all vignettes into episodes directory as a first step. This pr…
Browse files Browse the repository at this point in the history
…obably won't build, but commiting to have the original files in history.
  • Loading branch information
csmagnano committed Jan 10, 2024
1 parent b696f7f commit 4bd95b2
Show file tree
Hide file tree
Showing 8 changed files with 2,281 additions and 0 deletions.
436 changes: 436 additions & 0 deletions episodes/cell_type_annotation.Rmd

Large diffs are not rendered by default.

414 changes: 414 additions & 0 deletions episodes/eda_qc.Rmd

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
362 changes: 362 additions & 0 deletions episodes/hca.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
---
title: Accessing data from the Human Cell Atlas (HCA)
vignette: >
% \VignetteIndexEntry{Accessing data from the Human Cell Atlas}
% \VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
html_document:
mathjax: null
bibliography: references.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
```

# HCA Project

The Human Cell Atlas (HCA) is a large project that aims to learn from and map
every cell type in the human body. The project extracts spatial and molecular
characteristics in order to understand cellular function and networks. It is an
international collaborative that charts healthy cells in the human body at all
ages. There are about 37.2 trillion cells in the human body. To read more about
the project, head over to their website at https://www.humancellatlas.org.

# CELLxGENE

CELLxGENE is a database and a suite of tools that help scientists to find,
download, explore, analyze, annotate, and publish single cell data. It includes
several analytic and visualization tools to help you to discover single cell
data patterns. To see the list of tools, browse to
https://cellxgene.cziscience.com/.

# CELLxGENE | Census

The Census provides efficient computational tooling to access, query, and
analyze all single-cell RNA data from CZ CELLxGENE Discover. Using a new access
paradigm of cell-based slicing and querying, you can interact with the data
through TileDB-SOMA, or get slices in AnnData or Seurat objects, thus
accelerating your research by significantly minimizing data harmonization at
https://chanzuckerberg.github.io/cellxgene-census/.

# The CuratedAtlasQueryR Project

To systematically characterize the immune system across tissues, demographics
and multiple studies, single cell transcriptomics data was harmonized from the
CELLxGENE database. Data from 28,975,366 cells that cover 156 tissues (excluding
cell cultures), 12,981 samples, and 324 studies were collected. The metadata was
standardized, including sample identifiers, tissue labels (based on anatomy) and
age. Also, the gene-transcript abundance of all samples was harmonized by
putting values on the positive natural scale (i.e. non-logarithmic).

To model the immune system across studies, we adopted a consistent immune
cell-type ontology appropriate for lymphoid and non-lymphoid tissues. We applied
a consensus cell labeling strategy between the Seurat blueprint and Monaco
[-@Monaco2019] to minimize biases in immune cell classification from
study-specific standards.

`CuratedAtlasQueryR` supports data access and programmatic exploration of the
harmonized atlas. Cells of interest can be selected based on ontology, tissue of
origin, demographics, and disease. For example, the user can select CD4 T helper
cells across healthy and diseased lymphoid tissue. The data for the selected
cells can be downloaded locally into popular single-cell data containers. Pseudo
bulk counts are also available to facilitate large-scale, summary analyses of
transcriptional profiles. This platform offers a standardized workflow for
accessing atlas-level datasets programmatically and reproducibly.

```{r,echo=FALSE}
knitr::include_graphics(
"figures/HCA_sccomp_SUPPLEMENTARY_technical_cartoon_curatedAtlasQuery.png"
)
```

# Data Sources in R / Bioconductor

There are a few options to access single cell data with R / Bioconductor.

| Package | Target | Description |
|---------|-------------|---------|
| [hca](https://bioconductor.org/packages/hca) | [HCA Data Portal API](https://www.humancellatlas.org/data-portal/) | Project, Sample, and File level HCA data |
| [cellxgenedp](https://bioconductor.org/packages/cellxgenedp) | [CellxGene](https://cellxgene.cziscience.com/) | Human and mouse SC data including HCA |
| [CuratedAtlasQueryR](https://stemangiola.github.io/CuratedAtlasQueryR/) | [CellxGene](https://cellxgene.cziscience.com/) | fine-grained query capable CELLxGENE data including HCA |

# Installation

```{r,eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("stemangiola/CuratedAtlasQueryR")
```

# Package load

```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(CuratedAtlasQueryR)
library(dplyr)
```

# HCA Metadata

The metadata allows the user to get a lay of the land of what is available
via the package. In this example, we are using the sample database URL which
allows us to get a small and quick subset of the available metadata.

```{r}
metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL)
```

Get a view of the first 10 columns in the metadata with `glimpse`

```{r}
metadata |>
select(1:10) |>
glimpse()
```

# A note on the piping operator

The vignette materials provided by `CuratedAtlasQueryR` show the use of the
'native' R pipe (implemented after R version `4.1.0`). For those not familiar
with the pipe operator (`|>`), it allows you to chain functions by passing the
left-hand side (LHS) to the first input (typically) on the right-hand side
(RHS).

In this example, we are extracting the `iris` data set from the `datasets`
package and 'then' taking a subset where the sepal lengths are greater than 5
and 'then' summarizing the data for each level in the `Species` variable with a
`mean`. The pipe operator can be read as 'then'.

```{r}
data("iris", package = "datasets")
iris |>
subset(Sepal.Length > 5) |>
aggregate(. ~ Species, data = _, mean)
```

# Summarizing the metadata

For each distinct tissue and dataset combination, count the number of datasets
by tissue type.

```{r}
metadata |>
distinct(tissue, dataset_id) |>
count(tissue)
```

# Columns available in the metadata

```{r}
head(names(metadata), 10)
```

# Available assays

```{r}
metadata |>
distinct(assay, dataset_id) |>
count(assay)
```

# Available organisms

```{r}
metadata |>
distinct(organism, dataset_id) |>
count(organism)
```

## Download single-cell RNA sequencing counts

The data can be provided as either "counts" or counts per million "cpm" as given
by the `assays` argument in the `get_single_cell_experiment()` function. By
default, the `SingleCellExperiment` provided will contain only the 'counts'
data.

### Query raw counts

```{r}
single_cell_counts <-
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment()
single_cell_counts
```

### Query counts scaled per million

This is helpful if just few genes are of interest, as they can be compared
across samples.

```{r}
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm")
```

### Extract only a subset of genes

```{r}
single_cell_counts <-
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_single_cell_experiment(assays = "cpm", features = "PUM1")
single_cell_counts
```

### Extracting counts as a Seurat object

If needed, the H5 `SingleCellExperiment` can be converted into a Seurat object.
Note that it may take a long time and use a lot of memory depending on how many
cells you are requesting.

```{r,eval=FALSE}
single_cell_counts <-
metadata |>
dplyr::filter(
ethnicity == "African" &
stringr::str_like(assay, "%10x%") &
tissue == "lung parenchyma" &
stringr::str_like(cell_type, "%CD4%")
) |>
get_seurat()
single_cell_counts
```

## Save your `SingleCellExperiment`

### Saving as HDF5

The recommended way of saving these `SingleCellExperiment` objects, if
necessary, is to use `saveHDF5SummarizedExperiment` from the `HDF5Array`
package.

```{r, eval=FALSE}
single_cell_counts |> saveHDF5SummarizedExperiment("single_cell_counts")
```

# Exercises

1. Use `count` and `arrange` to get the number of cells per tissue in descending
order.

```{r}
# enter your code here
```

<details> <summary> Answer 1 </summary>

```{r,eval=FALSE}
metadata |>
count(tissue) |>
arrange(-n)
```

</details>

2. Use `dplyr`-isms to group by `tissue` and `cell_type` and get a tally of the
highest number of cell types per tissue combination. What tissue has the most
numerous type of cells?

```{r}
# enter your code here
```

<details> <summary> Answer 2 </summary>

```{r,eval=FALSE}
metadata |>
group_by(tissue, cell_type) |>
count() |>
arrange(-n)
```

</details>

3. Spot some differences between the `tissue` and `tissue_harmonised` columns.
Use `count` to summarise.

```{r}
# enter your code here
```

<details> <summary> Answer 3 </summary>

```{r}
metadata |>
count(tissue) |>
arrange(-n)
metadata |>
count(tissue_harmonised) |>
arrange(-n)
```

</details>

To see the full list of curated columns in the metadata, see the Details section
in the `?get_metadata` documentation page.

4. Now that we are a little familiar with navigating the metadata, let's obtain
a `SingleCellExperiment` of 10X scRNA-seq counts of `cd8 tem` `lung` cells for
females older than `80` with `COVID-19`. Note: Use the harmonized columns, where
possible.

```{r}
# enter your code here
```

<details> <summary> Answer 4 </summary>

```{r}
metadata |>
dplyr::filter(
sex == "female" &
age_days > 80 * 365 &
stringr::str_like(assay, "%10x%") &
disease == "COVID-19" &
tissue_harmonised == "lung" &
cell_type_harmonised == "cd8 tem"
) |>
get_single_cell_experiment()
```

</details>

# Session Info

```{r}
sessionInfo()
```

# Acknowledgements

Thank you to [Stefano Mangiola](https://github.com/stemangiola) and his team for
developing
[CuratedAtlasQueryR](https://github.com/stemangiola/CuratedAtlasQueryR) and
graciously providing the content from their vignette. Make sure to keep an eye
out for their publication for proper citation. Their bioRxiv paper can be found
at <https://www.biorxiv.org/content/10.1101/2023.06.08.542671v1>.

# References
Loading

0 comments on commit 4bd95b2

Please sign in to comment.