Skip to content

Commit

Permalink
Version 2.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
armintoepfer committed Sep 23, 2020
1 parent bad52ab commit fe09b8f
Showing 1 changed file with 48 additions and 11 deletions.
59 changes: 48 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,19 +64,19 @@ The sort order is defined by the barcode indices, lowest first.

*Lima* offers the following features:
* Process both, CLR subreads and CCS reads
* BAM in- and output
* BAM, FASTA, FASTQ in- and output
* Extensive reports that allow in-depth quality control
* Clip barcode sequences and annotate `bq` and `bc` tags
* Agnostic of input barcode sequence orientation
* Split output BAM files by barcode
* Split output files by barcode
* Full PacBio dataset support
* Peek into the first N ZMWs and get average barcode score
* Guess the subset of barcodes used in an input Barcode Set given a mean barcode score threshold
* Enhanced filtering options to remove ambiguous calls
* Double demux to remove PCR primers after barcode demultiplexing

## Latest Version
Version **1.11.0**: [Full changelog here](#full-changelog)
Version **2.0.0**: [Full changelog here](#full-changelog)

## Execution

Expand All @@ -86,13 +86,13 @@ Version **1.11.0**: [Full changelog here](#full-changelog)

Run on CLR subread data:

lima movie.subreads.bam barcodes.fasta prefix.bam
lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml
$ lima movie.subreads.bam barcodes.fasta prefix.bam
$ lima movie.subreadset.xml barcodes.barcodeset.xml prefix.subreadset.xml

Run on CCS data:

lima --ccs movie.ccs.bam barcodes.fasta prefix.bam
lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml prefix.consensusreadset.xml
$ lima --ccs movie.ccs.bam barcodes.fasta prefix.bam
$ lima --ccs movie.consensusreadset.xml barcodes.barcodeset.xml prefix.consensusreadset.xml

If you do not need to import the demultiplexed data into SMRT Link, it is advised
to use `--no-pbi`, omit the pbi index file, to minimize time to result.
Expand All @@ -109,8 +109,8 @@ to use `--no-pbi`, omit the pbi index file, to minimize time to result.

### Example execution

lima m54317_180718_075644.subreadset.xml Sequel_RSII_384_barcodes_v1.barcodeset.xml \
m54317_180718_075644.demux.subreadset.xml --different --peek-guess
$ lima m54317_180718_075644.subreadset.xml Sequel_RSII_384_barcodes_v1.barcodeset.xml \
m54317_180718_075644.demux.subreadset.xml --different --peek-guess


## Input data
Expand All @@ -119,6 +119,8 @@ unaligned CCS reads, generated by [CCS](https://github.com/PacificBiosciences/cc
both in the PacBio enhanced BAM format. If you want to demux RSII data, first
use SMRT Link or bax2bam to convert h5 to BAM. In addition, a `datastore.json`
with one file entry, either a SubreadSet or ConsensusReadSet, is also allowed.
In addition, CCS reads input are also supported as FASTA or FASTQ, optionally
gzipped.

Barcodes are provided as a FASTA file, one entry per barcode sequence,
**no duplicate** sequences, only upper-case bases,
Expand Down Expand Up @@ -159,14 +161,46 @@ prefix as the output file, omitting suffixes `.bam`, `.subreadset.xml`, and
`.consensusreadset.xml`. The report infix is `lima`.
Example:

lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/m54007_170702_064558_demux.subreadset.xml
$ lima m54007_170702_064558.subreads.bam barcode.fasta /my/path/m54007_170702_064558_demux.subreadset.xml

For all output files, the prefix will be `/my/path/m54007_170702_064558_demux.`

### BAM
The first file `prefix.bam` contains clipped records, annotated with
barcode tags, that passed filters.

### FASTA/Q
Alternatively, if output file is fasta or fastq, the header of each sequence
contains all tags, separated by a single whitespace, that would be present in
the BAM format. Example FASTQ header:

@m54006_171006_044150/4588126/ccs bc=3,3 bl=CGCGCGTGTGTGCGTG bq=100 bt=CGCGCGTGTGTGCGTG bx=16,16 cx=12 qe=2235 ql=p\tttropqorrtnnH qs=16 qt=G^\IGR]K8S>>^\^p

### In- and output compatibility matrix:

For CLR data, only XML and BAM are valid in- and output file types.

For CCS data, use following compatibility matrix:

| In/Out | XML | BAM | FASTQ | FASTA |
| ------ | :-: | :-: | :---: | :---: |
| XML | YES | YES | YES | YES |
| BAM | YES | YES | YES | YES |
| FASTQ | no | no | YES | YES |
| FASTA | no | no | no | YES |

This means, you can use CCS FASTQ reads as input and FASTA as output, but
not BAM as output.

Working example:

$ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.fastq --same

Failing example:

$ lima movie.Q20.fastq Sequel_RSII_384_barcodes_v1.fasta demuxed.bam --same
FATAL -|- Unsupported combination of FASTQ input and BAM output.

### Report
The second file is `prefix.lima.report`, a tab-separated file about each ZMW, unfiltered.
This report contains any information necessary to investigate the demultiplexing
Expand Down Expand Up @@ -1069,7 +1103,10 @@ any parameters now, but worth future investigation.

## Full Changelog

* **1.11.0**:
* **2.0.0**:
* Add support for FASTA and FASTQ
* Fix `-k` with by-strand HiFi reads
* 1.11.0:
* Add barcode to read groups, use one barcode pair per RG
* Fix double demux, used to clip wrongly for the second round of demuxing
* 1.10.0:
Expand Down

0 comments on commit fe09b8f

Please sign in to comment.