Skip to content

Commit

Permalink
Version 1.11.0
Browse files Browse the repository at this point in the history
  • Loading branch information
armintoepfer committed Feb 18, 2020
1 parent 1aa407e commit d49e93f
Show file tree
Hide file tree
Showing 6 changed files with 76 additions and 20 deletions.
96 changes: 76 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ by PCR or ligation:
In addition, there are three different barcode library designs.
In order to describe a barcode library design, one can view it
from a SMRTbell or read perspective.
As *lima* supports raw subread and CCS read demultiplexing,
As *lima* supports CLR subread and CCS read demultiplexing,
the following terminology is based on the per (sub-)read view.

<img src="img/barcode_overview.png" width="886px">
Expand Down Expand Up @@ -63,28 +63,28 @@ The sort order is defined by the barcode indices, lowest first.
## Features

*Lima* offers the following features:
* Process both, raw subreads and CCS reads
* Process both, CLR subreads and CCS reads
* BAM in- and output
* Extensive reports that allow in-depth quality control
* Clip barcode sequences and annotate `bq` and `bc` tags
* Agnostic of input barcode sequence orientation
* Split output BAM files by barcode
* No scraps.bam needed
* Full PacBio dataset support
* Peek into the first N ZMWs and get average barcode score
* Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
* Enhanced filtering options to remove ambiguous calls
* Double demux to remove PCR primers after barcode demultiplexing

## Latest Version
Version **1.10.0**: [Full changelog here](#full-changelog)
Version **1.11.0**: [Full changelog here](#full-changelog)

## Execution

**Note:** Any existing output files will be overwritten after execution.

**Note:** Always use `--peek-guess` to remove spurious barcode hits.

Run on raw subread data:
Run on CLR subread data:

lima movie.subreads.bam barcodes.fasta prefix.bam
lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
Expand All @@ -99,12 +99,12 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.

### *Symmetric* or *Tailed* options

Raw: --same
CLR: --same
CCS: --same --ccs

### *Asymmetric* options

Raw: --different
CLR: --different
CCS: --different --ccs

### Example execution
Expand All @@ -114,7 +114,7 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.


## Input data
Input data is either raw unaligned subreads, straight from a Sequel, or
Input data is either CLR unaligned subreads, straight from a Sequel I/II, or
unaligned CCS reads, generated by [CCS](https://github.com/PacificBiosciences/ccs);
both in the PacBio enhanced BAM format. If you want to demux RSII data, first
use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
Expand Down Expand Up @@ -189,24 +189,24 @@ how ZMWs many are *same/different*, and how many reads have been filtered.
Below min passes : 0 (0%)
Below min score lead : 11656 (32%)
Below min ref span : 3124 (8%)
Without adapter : 25094 (68%)
Without SMRTbell adapter : 25094 (68%)
With bad adapter : 10349 (28%) <- Only with --bad-adapter-ratio
Undesired hybrids : xxx (xx%) <- Only with --peek-guess
Undesired same barcode pairs : xxx (xx%) <- Only with --different
Undesired diff barcode pairs : xxx (xx%) <- Only with --same
Undesired same pairs : xxx (xx%) <- Only with --different
Undesired diff pairs : xxx (xx%) <- Only with --same
Undesired 5p--5p pairs : xxx (xx%) <- Only with --isoseq
Undesired 3p--3p pairs : xxx (xx%) <- Only with --isoseq
Undesired single side : xxx (xx%) <- Only with --isoseq
Undesired no hit : xxx (xx%) <- Only with --isoseq

ZMWs for (B):
With same barcode : 162244 (92%)
With different barcodes : 14112 (8%)
With same pair : 162244 (92%)
With different pair : 14112 (8%)
Coefficient of correlation : 32.79%

ZMWs for (A):
Allow diff barcode pair : 157264 (74%)
Allow same barcode pair : 188026 (88%)
Allow diff pair : 157264 (74%)
Allow same pair : 188026 (88%)
Bad adapter yield loss : 10112 (5%) <- Only with --bad-adapter-ratio
Bad adapter impurity : 10348 (5%) <- Only without --bad-adapter-ratio

Expand Down Expand Up @@ -446,7 +446,7 @@ Only use reads flanked by adapters on both sides for barcode identification,
full-pass reads.

### `--keep-idx-order`
Per default, the two identified barcode idx are sorted ascending, as in raw data,
Per default, the two identified barcode idx are sorted ascending, as in CLR data,
the correct order cannot be determined. This affects the `bc` tag, `prefix.counts`
file, and `--split-bam` file names; `prefix.report` columns `IdxLowest`,
`IdxHighest`, `IdxLowestNamed`, `IdxHighestNamed` will have the same order as
Expand All @@ -455,7 +455,7 @@ such as CCS.

If you are using an asymmetric barcode design with `NxN` pairs
and your input is CCS, you can use `--keep-idx-order` to preserve
the order. If your input is raw subreads and you use `NxN` asymmetric pairs,
the order. If your input is CLR subreads and you use `NxN` asymmetric pairs,
there is no way to distinguish between pairs `bc1001--bc1002` and `bc1002--bc1001`.

### `--per-read`
Expand Down Expand Up @@ -970,7 +970,7 @@ The score lead measures how close the best barcode call is to the second best.
Possible solutions without seeing your data:
* Is that sample actually barcoded?
* Are your barcode sequences genetically too close for SMRT sequencing?
Try CCS2 calling first and demultiplex with `--ccs`.
Try CCS calling first and demultiplex with `--ccs`.
* Are the synthesized products clean and not degenerate?
* Did the sequencing run perform optimally, is the accuracy in the expected range?
* Did you run lima twice, first on the original and then on the already
Expand All @@ -982,7 +982,7 @@ false positives.

### What is different in *lima* to *bam2bam*?
* CCS read support
* Barcodes of every adapter gets scored for raw subreads
* Barcodes of every adapter gets scored for CLR subreads
* Does not enforce symmetric barcode pairing, which increases PPV
* For asymmetric barcodes, `lima` can report the identified order, instead of
ascending sorting
Expand Down Expand Up @@ -1013,10 +1013,66 @@ then your XML input contains BioSamples with different barcode names than the
provided `barcode.fasta` file. Please check that you've used the correct
barcodes. You can ignore barcodes specified in the XML with `--ignore-biosamples`.

### CCS or demux first?
Many people have been wondering, what is the recommended order for a multiplexed
HiFi pool:
1) first ccs and then demux
2) first demux and then ccs

#### Experiment
Use 2k ecoli amplicons with barcoded overhang adapters, symmetric. Workflow steps:
1) Generate CCS
2) Demuxe subreads and whiteliste on CCS hole numbers
3) Demuxe CCS
4) Compare both sets of hole numbers

#### Results
Verbatim results for one chip:

Generated CCS reads : 274185
Demuxed CCS reads : 269919 (98.44%)
Demuxed subreads : 271068 (98.86%)

Venn diagrams for two chips:

<img src="img/venn_diagramm_1.png" width="400px">
<img src="img/venn_diagramm_2.png" width="400px">

Just based on those numbers, one would say, pick subread demuxing.
Here comes the but. Demuxing subreads is very IO heavy and takes ~100x longer
than demuxing CCS.
For the sake of time to result and disk space,
**perform CCS first and demux afterwards**.

#### Discussion
Q: Is there any systematic reason for reads that get correctly called by subread demux but not ccs or vice versa?

Let's plot subread barcode scores, grouped by if they were only called in subreads (blue) or not (red)
<img src="img/subread_only_scores.png" width="600px">

Majority of what is subread output only is on the verge of being called at all.
The problem with the current CCS draft stage is that it sometimes trims a few
bases, which is generally not a big issue for demuxing, but if the barcode is
molecularly damaged, too short or of low quality, a few missing bases lead to
being uncallable.

Vice versa, only called by CCS and not in subreads:
<img src="img/ccs_only_scores.png" width="600px">

Again something that is on the verge being called. The reason for the ~300 reads
at 100 score, no idea so far. In general, this is 0.1% of the data.
Let's investigate those ~300 calls and plot their subread demux barcode scores.
<img src="img/ccs_only_subread_scores.png" width="600px">

It's curious why they didn't get called, but for 0.1% not worth changing
any parameters now, but worth future investigation.

## Full Changelog

* **1.10.0**:
* **1.11.0**:
* Add barcode to read groups, use one barcode pair per RG
* Fix double demux, used to clip wrongly for the second round of demuxing
* 1.10.0:
* Output N barcodes per subdirectory with `--files-per-directory N` and output splitting
* BioSample awareness for XML input and split output and allow ignoring them with `--ignore-biosamples`
* Increase `--window-size-mult` to `3` to allow longer spacers
Expand Down
Binary file added img/ccs_only_scores.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ccs_only_subread_scores.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/subread_only_scores.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/venn_diagramm_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/venn_diagramm_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d49e93f

Please sign in to comment.