README

SToRM - Seed-based Read Mapping tool for SOLiD or Illumina sequencing data

[ABOUT]

 SToRM is a read mapping tool designed first for the SOLiD technology, but
 also available now for the classical Illumina technology. First, SToRM is 
 based on seeding techniques adapted to the statistical characteristics of
 reads: the seeds and the alignment algorithms are designed to comply with
 the properties of the SOLiD color encoding or Illumina substitution model 
 as well as the observed reading error distribution along the read.

 More information at  http://bioinfo.cristal.univ-lille.fr/storm/storm.php


[CONTENTS OF THIS ARCHIVE]

    README
    Makefile
    *.c, *.h - source code
    seeds/   - several seed families adapted for SOLiD/Illumina reads


[COMPILING THE SOURCES]

 Requirements: 
    gcc (>=4.2), OpenMP support, Linux or MacOS

 Command line: 
    "make" / "make storm-color" / "make storm-nucleotide" (multithreaded)


[RUNNING SToRM]

  Command line: 
    "./storm -M 1 -g <path_2_reference.fas> -r <path_2_reads.fastq> [opts]"
    "./storm -h" for full parameter list

   A classic command line:
    "./storm-color      -M 1 -g reference.fas -r reads.cs -o out.sam"
    "./storm-nucleotide -M 1 -g reference.fas -r reads.fq -o out.sam"

  Advanced command line: 

    for "faster" /or/ "more sensitive" mapping :

     [0] obviously, use -M <int> to have all mapable reads being mapped !!

         Otherwise,  only the best read is kept per reference position,
         which implies that not all the reads (duplicate, lower quality
         ones at the same reference position) are mapped.

     [1] use "larger" /or/ "smaller" scoring thresholds
      ...
      -z 150 -t 150
      -z 120 -t 120
      -z 100 -t 100
      ...
      By default, setting the same value for both -z/-t is a good idea,
      unless you have ">15 indels" or "SOLiD color correction" reasons.
      ...
      -z 105 -t 120
      -z 85 -t 100
      -z 75 -t 90
      ...
     
      note that E-value "may be somehow" checked with the ALP tool:
         http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/software/program.html?uid=6


     [2] "increase" /or/ "decrease" the number of indels :
      ...
      -i 1
      -i 3 (default value)
      -i 7
      -i 15 
         for full SIMD support.
      ...
      -i 20
      ...
         for final alignment support only : in that case, the SIMD code
         generates a warning explaining that it will compute at most 15
         diagonals : you can thus change the SIMD threshold -z <int> to
         a smaller value than the final alignment threshold -t <int>.
         (see [1] before)

     [3] use "less" /or/ "more" seeds of "large" /or/ "small" weight:  
      ...
      -s "####-##-#-####;###-#--##-##--###;####-#---#----#--####"
      -s "##-##-###-###;###--#-#--#--####;####--#--#---#---###
      ...

      note first that iedera can help you in the seed design :
         http://bioinfo.cristal.univ-lille.fr/yass/iedera.php
             ./iedera -spaced -n 3 -w 11,11 -s 11,22 -r 10000 -k -l 40
             ./iedera -spaced -n 3 -w 10,10 -s 10,20 -r 10000 -k -l 40
      => you can send requests to Laurent Noé for specific seeding

      note also that "chain processing" can be now applied in SToRM :
        unmmaped reads from the first step can be mapped in extra steps
        ...
        "./storm-nucleotide -M 1 -s <...   big seed(s) ...> -g ref.fas -r reads.fq -o out.sam -O reads2.fq"
        "./storm-nucleotide -M 1 -s <... small seed(s) ...> -g ref.fas -r reads2.fq -o out2.sam -O reads3.fq"
        ...

        e.g. an example for single seed:
          "###-##--#-##-#-###" (weight 12)
               -then-> 
          "####-#-#--##-###"   (weight 11)
               -then->
          "#####-##-###"       (weight 10)
            
        theses 3 seeds are generated with :

          ./iedera -spaced -l 50 -w 12,12 -s 12,24 -r 10000 -k   
             ->  "###-##--#-##-#-###" 
          ./iedera -spaced -l 45 -w 11,11 -s 11,22 -r 10000 -k -mx "###-##--#-##-#-###"
             ->  "####-#-#--##-###"
          ./iedera -spaced -l 40 -w 10,10 -s 10,20 -r 10000 -k -mx "###-##--#-##-#-###,####-#-#--##-###"
             ->  "#####-##-###"

     [4] "use" the -l option with smaller value (e.g. -l 10) /or/ "use"
          with bigger value (e.g. -l 1000)/"disable" with "<= 0" value

     [5] check your quality and change the -b 33 to -b 0 to disable the 
         score scaling algorithm, or to -b 64 (for Illumina 1.5 to 1.7) 
         accordingly ...


[READING THE RESULT]

 SToRM generates its output in the SAM format. To visualise the result, we 
 recommend use of the SAMtools package (http://samtools.sourceforge.net/ - http://www.htslib.org/).

 For both a SToRM output file "out.sam" and a reference file "ref.fas", the 
 samtools command lines usually are:


     samtools faidx ref.fas

     samtools view -bT ref.fas out.sam > out.bam

     ##
     ## Only when needed, merge several bam files into one single : 
     samtools merge -f out.bam   out{1}.bam out{2}.bam ... out{42}.bam

     ##
     ## Only if the "-M <int>" option is set: the "out.sam"/"out.bam"
     ## are "read sorted" and not "ref sorted" in that case, so : 
     samtools sort out.bam -o out_sorted.bam

     samtools index out_sorted.bam

     samtools tview out_sorted.bam ref.fas


 Please refer to the SAMtools documentation for more usage details.