Skip to content

1. Introduction

GDelevoye edited this page Nov 13, 2022 · 1 revision

Introduction

The CIGAR string

  • The .sam file format is a widely-spread standard in bioinformatics to store aligned NGS reads.

    • It is tab-separated
    • Its full description can be found here.
  • The .sam files are usually too big to be stored easily

    • Their counter-part binary and compressed .bam format is often used instead.
    • The tool "samtools view" allows to make the .bam->.sam and .sam->.bam conversions
  • The 5th field of the file stores what is called a "CIGAR string" (see 1, 2, 3 for details)

    • It corresponds to how the read mapped on the reference genome.

BAM_presentation

In this CIGAR, all 466 nucleotides of the read match with the reference genome.

The CIGAR string is composed of several groups of contiguously identical types of alignment :

  • "=" stands for a group of matches
  • "X" stands for a group of mismatches (substitutions)
  • "M" often stands for "group of matches" too, but on old files it is sometimes used as "Mismatch". Its real meaning should really just be "Misleading". However, you have to take account of this in some files, which is annoying.
  • "D" stands for deletions
  • "I" stands for insertions
  • "N" is generally used to indicate that a section is missing in the sequence and present in the reference genome, also called "spliced sequence". Often, it is distinct from a simple deletion/gap because it corresponds to an intron. It mostly used in cDNA datasets
  • "P" stands for padding. Padded alignment is always produced by de novo assemblers and can be considered as a silent deletion from padded reference sequence. It describes the alignment of inserted sequences.
  • "S" stands for "soft-clipping". In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. These groups are usually at the extremity of the read, but some aligners can also produce them in the core of the sequence.
  • "H" : Hard-clipping. When a sequence that is aligned with soft-clipping has secondary matches elsewhere, the soft-clipping is not reported anymore. The clipped nucleotides are removed from the sequence in the secondary alignments, pure and simple. To tell bioinformaticians that these sequences were removed, the programs usually add this "Hard-clipping" in the CIGAR. It is often considers as a bad practice / premature optimization, and leads to unpleasant difficulties when interpreting the alignment.

Bioinformaticians may want to perform many tasks based on the CIGAR string :

  • Compute a custom percentage of identity (PID) between the read and the reference
  • Filter the reads with a number of gaps, insertions, substitutions or deletions that is superior to a given threshold
  • Extract only the reads that map on at least one intro of a size superior to a given threshold
  • ...

In general, they would like to iterate over the groups and perform more or less complex operations, on (litteraly) billions of lines.

State of the art

Today there are several ways to do so :

  1. Tricky combinations of awk, grep, sort, cat, regex, along with samtools view

  2. Custom Python/Perl/R scripts or Java programs, often along with samtools view

  3. Use an API to read the .bam files programmatically :

Problematic

The limits/flaws of existing tools

  1. Bash, awk-based or regex-based solutions should be banned because they are :

    • Just copy-pasted from stack-overflow or biostars
    • Difficult to read (and therefore to adapt/modify)
    • Error-prone
    • Slow
    • Often not cross-platform
  2. Python/R/Perl scripts are more versatile but can be too slow for complex analysis on big datasets (>100Gb).

  3. Some interesting tools already in java, but my impression (perhaps wrong) is that most bioinformaticians don't code in Java.

  4. C/C++ API like HTSlib are not well documented, and not easy to use/distribute.

What about PySam ?

During their quest of filtering a .bam file by CIGAR string, most bioinformaticians will eventually give PySam a try.

At first, PySam seems like a great idea at first to read a .bam file. Because it is mostly "C code called by python", everyone expects it to run fast. But it turns out to be disapointing in practice. Why ?

I can think of two reasons :

  1. Too much python code wraps the C/C++ calls. I don't actually think that it is the main reason
  2. Reading a file from the hard drive is IO-bonded, not CPU bonded.
    • Most of CPU time actually consists in waiting for the hard drive.
    • If you want to improve the reading performances, you must rather use huge buffers and not flush it too often, to read several hundreds of lines at the time. That is, from my understanding, the exact opposite of how pySam is designed.
  3. While reading a .bam file is not "that" CPU-intensive, parse and filter a CIGAR string does require intensive computations, something that python cannot handle very fast. If any part of a pipeline must be coded in C, it should not be the reading of the .bam file but rather the filtering of the CIGAR string.

What other (faster) solution can we find ?

In fact, it is (to date) nearly impossible to have anything faster than samtools view to read a .bam file. If you want to speed up samtools view, most of times you'd just have to buy a faster hard drive. Unsing an API for this task would be just overkill (especially since parsing a CIGAR is quite easy to implement).

The only thing you can do to speed things would be to use C/C++ (or another fast language like Rust) for the filtering part. Very few people "speak" Rust in the bioinformatics community. Many, however, can code basic functions in C/C++.

Using plain C/C++ would be optimal, but it means that bioinformaticians must invest much time and energy in a C++ code they can even build any relevant filter. Unfortunately, most of us face hard deadlines, and hungry bosses or collaborators that want the job done by the end of the day (or if you are lucky; by the end of the week). We don't have time to lose with setting up a building chain, distribute our code on a cluster, etc

The main (only ?) purpose of cigarfilter is to lower the threshold of difficulty so that even [very] occasional C/C++ programmers can code super efficient filters for their .bam files in no time.