Skip to content

Latest commit

 

History

History
137 lines (100 loc) · 12 KB

README.md

File metadata and controls

137 lines (100 loc) · 12 KB

Single-Cell-RNA-seq

One of the most popular single cell isolation techniques is with the Drop-seq approach, where a single cell is encapsulated into a micordroplet containing a bead with unique barcodes, primers and enzymes where cDNA synthesis and library generation is performed.

Macosko et al., 2015

One of the most widely used technologies for the Drop-seq approach is the 10x Genomics Chromium platform. There has been a study by Svensson et al,2017 comparing the different scRNA methods. If transcript level quantificaion and detection is what your experiment is aiming for, then the drop-seq (10x chromium) is fine. If transcript isoforms are part of the study then the full-length method (Smart-seq2) is required.

Since my lab focuses on comparing transcript levels I will only outline the experimental steps for the 10x chromium protocol. Below is a schematic of the chromium chip where the cells that have been enzymatically broken up and mixed with reagent into the 2nd row of the chip are added. The microfluidics will mix these cells with the beads and oil to form the gel emulsion (GEMs) at the top row. The GEMs from this row are taken off the chip and put on a thermocycler for cDNA synthesis which will also incorporate a unique barcode. Then the GEMs are broken open, amplified, and cleaned up for sequencing. The remainder of the schematic outlines the data processing and visualization steps.

When setting up the experiment it is important that samples should be balanced, evenly distributed, across all stages of the experiment. This will reduce sources of technical variation in the experiment. 

For example, you have samples on Day 0 sequenced on one flowcell and Day 7 samples run on another flowcell. The variation you observe can not be determined to be from the stage, or from technical variation in the sequencing run. Therefore, samples from day 0 and day 7 should be run on both flowcells. 

There are many similarities to the processing of scRNA-seq and the traditional RNA-seq. Both of which go through the same initial processing of read quality assessment, alignment and mapping quality assessment.

The major difference is each library in a scRNA-seq represents one cell instead of a population. This creates some considerations on how the data can be filtered before analyzing. Things to consider are:

  • library amplication depth: each cell can have differing number of reads between them.
  • Gene 'dropouts': A gene may have a moderate amount reads in one cell but not the other.

The above can be introduced due to low starting material. A way to mediate this is to have lots of cells. It's difficult to estimate how may different types of cells you have in a sample but there's a calculator that can estimate this based on some assumptions. Based on the 10x literature and reading papers (PMID: 32795399, PMID: 32302522) I would say 5,000 cells per condition per replicate is a good enough for complex samples (20 different cell types).

The sequencing depth recommended is a minimum of 20,000 read pairs per cell.  Paired-ended 50 bp reads.

However, it has been reported that the optimal allocation is to sequence one read per cell per gene =

  • For humans: ~21,000 read pairs per cell
  • For mouse:  ~25,000 read pairs per cell

An example of a sequencing run using the 10x chromium chip is outlined below. In this example you have multiple samples that are processed through multiple GEM wells which generate multiple libraries that are pooled into one flowcell. After demultiplexing, the 10x software cellranger performs a count separately for each GEM well. For example, if you have 6 samples (6 GEM wells) you have to run cellranger count six times. Then you can aggregate them with a single instance of cellranger aggr.

The cellranger software performs the following:

  • cellranger mkfastq Demultiplexes raw base call (BCL) files and  outputs fastq. If sequencing facility hasn't already done so.
  • cellranger count Performs alignment, filtering, barcode counting, and UMI counting.
  • aggregates outputs from multiple runs of cellranger count, normalizing those runs to the same sequencing depth. See Multi-Library Aggregation.

The output of the above is a UMI count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column). This file is used with the Seurat R library to select and filter cells based on QC metrics, data normalization & scaling, and the detection of highly variable features.

@@ Below are the steps to analyze the data starting with fastq files. @@

Read quality assessment

Run fastqc and MultiQC for read quality and quantitiy.

fastqc *.fastq.gz
multiqc .

Trimming

If trimming of the low quality bases and adapters are required there are different trimming tools available.

Trim Galore, that uses Cutadapt and FastQC, which will do this and automatically generate fastQ files afterwards. It also has built in adapter auto-detection. If you want to use trim galore, Cutadapt and FastQC have to be preinstalled.

# Check that cutadapt is installed
cutadapt --version
# Check that FastQC is installed
fastqc -v
# Install Trim Galore
curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.6.6.tar.gz -o trim_galore.tar.gz
tar xvzf trim_galore.tar.gz
# Run Trim Galore
~/TrimGalore-0.6.6/trim_galore -o fastqc_trimmed_results Sample_1.fastq Sample_2.fastq

Demultiplexing

This step is performed to demultiplex the UMI (12 nt unique molecular identifier) which is used to identify the single cell. For Smartseq2 or other paired-end full transcript protocols the data will usually already be demultiplexed. If the sequencing facility hasn't already demultiplexed you have to do it yourself. Also in the droplet-based protocol  the cell-barcode will be attached to the read name and demultiplexing will happen during the quantification step.

Cell Counting

Most experiments will have multiple samples, processed from multiple wells of a chromium chip (8 max). Each well represents a treatment, timepoint, or condition and has its own barcode to distinguish it from the other wells. You must perform the count step separately for each well, see the figure above.

The cellranger software requires lots of memory (64 GB) and 16 cores (CPU). Use a HPCC to run it.

cellranger count --id=day1 \
--fastqs=/scRNA_seq/d1_N2B27/ \
--transcriptome=refdata-gex-mm10-2020-A

The output files are listed below:

Summary of what each file is:

File Name Description
web_summary.html Run summary metrics and charts in HTML format
metrics_summary.csv Run summary metrics in CSV format
possorted_genome_bam.bam Reads aligned to the genome and transcriptome annotated with barcode information
possorted_genome_bam.bam.bai Index for possorted_genome_bam.bam
filtered_feature_bc_matrix Filtered feature-barcode matrices containing only cellular barcodes in MEX format. (In Targeted Gene Expression samples, the non-targeted genes are not present.)
filtered_feature_bc_matrix_h5.h5 Filtered feature-barcode matrices containing only cellular barcodes in HDF5 format. (In Targeted Gene Expression samples, the non-targeted genes are not present.)
raw_feature_bc_matrices Unfiltered feature-barcode matrices containing all barcodes in MEX format
raw_feature_bc_matrix_h5.h5 Unfiltered feature-barcode matrices containing all barcodes in HDF5 format
analysis Secondary analysis data including dimensionality reduction, cell clustering, and differential expression
molecule_info.h5 Molecule-level information used by cellranger aggr to aggregate samples into larger datasets
cloupe.cloupe Loupe Browser visualization and analysis file
feature_reference.csv (Feature Barcode only) Feature Reference CSV file
target_panel.csv (Targeted GEX only) Targed panel CSV file

Aggregating Multiple GEM Wells

A CSV file must first be created with the following columns:

  • library_id: Unique identifier for this input GEM well. This will be used for labeling purposes only; it doesn't need to match any previous ID you've assigned to the GEM well.
  • molecule_h5: Path to the molecule_info.h5 file produced by cellranger count.
library_id molecule_h5
day0_naive /scRNA_seq/d0_2iL/d0_2iL/outs/molecule_info.h5
day1_epi /scRNA_seq/d1_N2B27/day1/outs/molecule_info.h5
day2_epi /scRNA_seq/d2_N2B27/day2/outs/molecule_info.h5

if you have different batches such as samples run with different 10X kits ( like version 2 or version 3) you can set up the above table with another column labeled "batch" and the software will perform batch correction.

Run cellranger aggr without normalization (none) this is because we will use Seurat to perform the normalization and it requires un-normalized counts.

cellranger aggr --id=mouseNaive_epiblast \
--csv=data.csv \
--normalize=none

Seurat

Use the file filtered_feature_bc_matrix, generated from above step, as input into seurat. The filtered feature file is used as it contains only the detected cell barcodes without the background.

Follow the steps in the r file: seurat_scRNAseq.R and view the example with the outputs here.