README

RIBOSOME PROFILING ANALYSIS FRAMEWORK; de Klerk E., Fokkema I.F.A.C et al (2015).
FOOTPRINT ALIGNMENT, TRIPLET PERIODICITY ANALYSIS, PEAK CALLING.
================================================================================

  This work is licensed under the Creative Commons
  Attribution-NonCommercial-ShareAlike 4.0 International License. To view a
  copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/
  or send a letter to:
  Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

  The latest version of the scripts and this README file are available at:
  http://lumc.github.io/ribosome-profiling-analysis-framework/

================================================================================
WORKFLOW
================================================================================

This framework has been developed on Linux and tested on various Linux systems.
It may run on Windows, but this is not supported. Several provided examples for
running the scripts on a large number of files are written in Bash and will
therefore not run on Windows without additional software, like Cygwin.
This framework requires the installation of PHP-CLI (Command Line Interface),
available from http://php.net/.
This framework has been developed and tested using PHP 5.4 and up, but may work
with earlier versions.


OVERVIEW OF THE FRAMEWORK
--------------------------------------------------------------------------------
- Alignment of ribosome footprint reads
- Analysis of read length distribution of ribosome footprints
- Creating wiggle files with genomic coordinates from transcriptome alignment
- Merging wiggle files from transcriptome and genome alignments
- Retrieving coordinates relative to the annotated open reading frames (ORFs)
  using Mutalyzer
- Analysis of triplet periodicity
- Merging biological replicates prior to peak calling (identification of
  translation initiation sites (TISs))
- Peak calling: identification of primary ORFs, alternative ORFs and upstream
  ORFs
- Merging result files
- Analysis of switches in TIS usage
- TIS location and motif analysis


SAMPLE NAMING
--------------------------------------------------------------------------------
Throughout the documentation of this framework, we provide example commands.
The examples provided here are from the study of de Klerk E., Fokkema I.F.A.C.
et al. (2015), in which two different treatments were performed (harringtonin
and cycloheximide) in two different timepoints, with three biological
replicates. Different treatments or timepoints are named A to Z, with replicates
named in numbers (1 to 9). We recommend using similar, or equal, sample naming.
This framework will read the sample names from the beginning of the file names,
if separated with an underscore or period.
Examples of possible FASTQ file names:
A1_pool7213_TATAT_L002_R1_001.fastq (sample name: A1)
D3.fastq (sample name: D3)
I2_ACACAC_L008_R1_001.fastq (sample name: I2)


REFERENCE SEQUENCES
--------------------------------------------------------------------------------
This framework is based on the Mus Musculus mm10 genome assembly but it can also
be adapted for the Homo Sapiens hg19 and hg38 genome assemblies. Other
assemblies are currently not supported due to their absence in the Mutalyzer
tool, which is used to translate genomic coordinates into transcriptomic
coordinates and vice versa. See https://v2.mutalyzer.nl for more information.

The transcriptome reference sequence was downloaded from
ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.rna.fna.gz
Last modified date: 2013-05-08.

If mouse.rna.fna.gz is not available, you can download the separate files (named
mouse.1.rna.fna.gz, mouse.2.rna.fna.gz and so on) and combine them using bash
with:

for file in mouse.?.rna.fna.gz; do cat $file >> mouse.rna.fna.gz; done

The genome reference sequence was downloaded from
ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/
Last modified date: 2012-02-09.


ALIGNMENT OF RIBOSOME FOOTPRINT READS
--------------------------------------------------------------------------------
To prevent loss of reads or misalignment of short reads (30 nt) adjacent or
spanning an exon-exon junction, reads are first mapped to the transcriptome
reference, and unaligned reads are subsequently mapped to the genome reference.
For more information about the alignment method, please see de Klerk E., Fokkema
I.F.A.C. et al (2015).

1) The SAM file obtained from the transcriptome alignment reports positions of
mapped reads relative to the start of the transcript. To be able to merge the
results of the genome and the transcriptome alignment these positions need to be
converted to genomic positions. Transcriptome sequences do not always properly
align to the genome, for instance due to the presence of insertions, deletions,
chimeric alignments, or they align to random chromosomes. Because the
coordinates of the reads mapped to the transcriptome reference cannot be
correctly converted to genomic coordinates for transcripts that do not properly
align, first the transcriptome reference needs to be filtered for these
transcripts. For the remaining transcripts, the
mm10_transcript_positions_create.php script is used to create a file that
indicates the location of the exons of the transcripts on the genome.
For more information on this script and the alignment, please see the help file:
help/mm10_transcript_positions_create.txt.

Usage:
./mm10_transcript_positions_create.php TRANSCRIPTOME_ALIGNMENT_SAM_FILE

Produces:
- mm10_transcript_positions.txt
  This file contains the genomic coordinates of all exons for transcripts that
  align to the genome, excluding random chromosomes, without insertions or
  deletions or chimeric alignment. These can therefore be used for the
  transcriptome alignment.
- transcriptome_alignment_unsupported_transcripts.txt
  Contains all unsupported transcripts and the reasons for rejection.


2) The unsupported transcripts are then removed from the transcriptome
reference, using the following bash code:

cut -f 1 transcriptome_alignment_unsupported_transcripts.txt | grep -v '^#' \
    > transcriptome_alignment_unsupported_transcripts_list.txt
tr '\n' '+' < mouse.rna.fna | sed 's/+>/\n>/g' \
    | grep -vFf transcriptome_alignment_unsupported_transcripts_list.txt \
    | tr '+' '\n' > mouse.rna.fna.no_unsupported.fa

The resulting file (mouse.rna.fna.no_unsupported.fa) is then used to create a
Bowtie index file suitable for alignment using bowtie.


3) Before alignment, the adapter sequences need to be removed from the reads.
There are various tools available which do this, such as cutadapt.
Make sure that after this step, your files are named *.trunc.


4) The following set of commands runs Bowtie to align the reads to the
transcriptome reference, store the unaligned reads, and finally align the
unaligned reads to the genome reference.
For the alignments on the transcriptome, only the forward mapped reads are
selected, and then the SAM files are filtered, requiring a minimum of 25
nucleotides per read. SAM files obtained from the alignment on the genome are
also filtered, requiring a minimum of 25 nucleotides per read. SAM files from
the genome alignment are converted to BAM files, which are then sorted and
converted into mpileup files. The mpileup files are then converted to 5' end
wiggle files, one for each strand. For the creation of 5' end wiggle files from
the genome alignment, use the SageWiggle.py script included in this framework.

In the commands below, replace "mouse.rna.fa.no_unsupported.fa.bowtieindex" with
the correct location of the bowtie index file of your transcriptome reference
sequence file, filtered for unsupported transcripts (see step 2).
Replace "mm10.reference.fa.bowtieindex" with the correct location of the
bowtie index file of your genome reference sequence file.


for file in *.trunc; do bowtie -k 1 -m 20 -n 1 --best --strata -p 8 -S --chunkmbs 2000 --norc --un "${file}.transcriptome_unaligned" mouse.rna.fa.no_unsupported.fa.bowtieindex "$file" "${file}.transcriptome_aligned.sam" 2>> results.transcriptome_alignment.txt; done;
for file in *.transcriptome_unaligned; do bowtie -k 1 -m 2 -n 1 -p 8 -S --best --strata --chunkmbs 2000 mm10.reference.fa.bowtieindex "$file" "${file}.genome_aligned.sam" 2>> results.transcriptome_unaligned.genome_alignment_mm10.txt; done;
for file in *.transcriptome_unaligned.genome_aligned.sam; do echo $file; head -n 100 "$file" | grep "^@" > "${file}.M25.sam"; awk '($6 > 25M)' "$file" >> "${file}.M25.sam"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam; do echo $file; samtools view -bS "$file" > "${file}.bam"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam; do echo $file; samtools sort "$file" "${file}.sorted"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools index "$file"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam; do echo $file; samtools mpileup -d 10000000 "$file" > "${file}.mpileup"; done;
for file in *.transcriptome_unaligned.genome_aligned.sam.M25.sam.bam.sorted.bam.mpileup; do echo $file; python sageWiggle.py -i "$file" -o "${file}.F.wig5" "${file}.R.wig5"; done;
for file in *.transcriptome_aligned.sam; do echo $file; awk '($2 == 0)' "$file" > "${file}.mappedforward.sam"; done;
for file in *.transcriptome_aligned.sam.mappedforward.sam; do echo $file; awk '($6 > 25M)' "$file" > "${file}.M25.sam"; done;
for file in *.transcriptome_aligned.sam.mappedforward.sam.M25.sam; do echo $file; cut -f 1,3,4 "$file" > "${file}.3col.sam"; done


5) Multiple scripts have transcript names but need either strand or gene
information, or vice versa. To provide this, download a file from the UCSC table
browser, listing the transcripts, the strand and the genes.

To build this list:
- Open the UCSC table browser: http://genome.ucsc.edu/cgi-bin/hgTables
- Select genome, assembly, track: "RefSeq Genes".
- Table: "refGene", output format: "selected fields from primary and related
  tables".
- Click "get output".
- On this next page, select "name", "strand" and "name2" in the field list, and
  click "get output".
- Save in a file and sort. On linux, this can be done with the sort command.
- In this README, this file is referred to as mm10_gene_list.txt.


ANALYSIS OF READ LENGTH DISTRIBUTION OF RIBOSOME FOOTPRINTS
--------------------------------------------------------------------------------
6) The SAM files generated by the alignment of the ribosome footprint reads can
be used to retreive information on the read length distribution. A simple bash
script that calculates the read length distribution for reads aligning to the
transcriptome (NM and NR separately) and reads aligning to the genome, is
described in help/read_length_distribution.txt.

The read length distribution can also be calculated per gene. For that, use the
get_read_length_per_gene.php script. The script takes gene IDs in an input file
and determines the read length distribution for its transcripts per sample, from
both transcriptome and genome alignment SAM files. It requires the
mm10_gene_list.txt file to see which transcripts belongs to which gene, the
mm10_transcript_positions.txt file to check the strand and the positions of the
transcripts, and the input SAM files.

Usage:
./get_read_length_per_gene.php FILE_WITH_GENES_TO_ANALYZE mm10_gene_list \
    mm10_transcript_positions.txt SAM_FILE [SAM_FILE [SAM_FILE [...]]]

Example:
./get_read_length_per_gene.php genes_to_analyze.txt mm10_gene_list.txt \
    mm10_transcript_positions.txt *.M25.sam

Produces:
- For each transcript of the given genes, one
  gene_transcript_read_length_distribution.txt file, with the read lengths in
  the first column, and the number of reads with this length per sample in the
  next columns.
- One summary file, ALL_ALL_read_length_distribution.txt, with all read lengths
  summed up.


This README will continue describing the steps to follow for triplet periodicity
analysis and ORF analysis (peak calling), regardless of whether the read length
distribution has been analyzed or not.


GROUPING READS FROM SAM FILES OBTAINED FROM TRANSCRIPTOME ALIGNMENT
AND BASIC STATISTICS
--------------------------------------------------------------------------------
7) Mapped reads in the SAM files are first grouped together to generate a file
that sums the coverage per transcript per position.
For this, use the pack_sam_files.php file.

Usage:
./pack_sam_files.php SAM_FILE [SAM_FILE [SAM_FILE [...]]]

Example:
./pack_sam_files.php *.transcriptome_aligned.*3col.sam

Produces:
- For each given SAM file, a SAM.packed file.
  Contains the transcript ID, the position relative to the start of the tran-
  script, and the total coverage for that position, i.e. the number of reads
  aligned starting at this position.


8) The packed SAM files can be used to run some statistics, using
packed_sam2coverage_per_gene.php.

Usage:
./packed_sam2coverage_per_gene.php mm10_gene_list.txt \
    PACKED_SAM_FILE [PACKED_SAM_FILE [...]]

Example:
./packed_sam2coverage_per_gene.php mm10_gene_list.txt *.packed

Produces:
- One file which name depends on the given SAM file(s), ending in the suffix
  .coverage_per_gene.txt, containing (per SAM file) which transcripts can not
  be matched to a gene, listing the top 5 unknown transcripts based on their
  coverage, and a table containing coverage per gene per sample. The
  genes are sorted on total coverage in all samples together.

Other statistics can be run directly on the packed SAM files, for instance how
many reads are mapped to non-coding reads.


CREATING WIGGLE FILES WITH GENOMIC COORDINATES FROM TRANSCRIPTOME ALIGNMENT
--------------------------------------------------------------------------------
9) The packed SAM files are used to generate wiggle files. For this, use the
packed_sam2wiggle.php script, which needs the transcript positions file created
in 1).

Usage:
./packed_sam2wiggle.php mm10_transcript_positions.txt \
    PACKED_SAM_FILE [PACKED_SAM_FILE [PACKED_SAM_FILE [...]]]

Example:
./packed_sam2wiggle.php mm10_transcript_positions.txt *.packed

Produces:
- 4 wiggle files per SAM file;
  + F unfiltered,
  + F filtered (NR and XR removed),
  + R unfiltered,
  + R filtered (NR and XR removed).

Unfiltered 5' wiggle files are used for visualization purposes only, whereas
filtered 5' wiggle files, containing positions of reads mapping only to coding
transcripts, are used for triplet periodicity analysis and peak calling.
Positions mapping to non-coding transcripts won't be processed with this
analysis framework.


MERGING WIGGLE FILES FROM TRANSCRIPTOME AND GENOME ALIGNMENTS
--------------------------------------------------------------------------------
10) The merging of the wiggle files from the alignment on the transcriptome and
the alignment on the genome is done using the "wiggelen" package written by
Martijn Vermaat and Jeroen Laros. The merge_wigglefiles.sh script is a wrapper
around the wiggelen package.
For detailed information, see the help file help/merge_wigglefiles.txt.

Usage:
./merge_wigglefiles.sh
(it should be run in the directory where all the wiggle files reside)

Produces:
- 4 wiggle files per sample, using the transcriptome and genome alignment wiggle
  files for each created file:
  + F unfiltered,
  + F filtered (NR and XR removed),
  + R unfiltered,
  + R filtered (NR and XR removed).

Unfiltered 5' wiggle files are used for visualization purposes only, whereas
filtered 5' wiggle files, containing positions of reads mapping only to coding
transcripts, are used for triplet periodicity analysis and peak calling.
Positions mapping to non-coding transcripts won't be processed with this
analysis framework.


RETRIEVING COORDINATES RELATIVE TO THE ANNOTATED OPEN READING FRAMES (ORFs)
USING MUTALYZER
--------------------------------------------------------------------------------
11) To translate the genomic coordinates to coordinates relative to the
annotated Translation Initiation Site (TIS), we use the Mutalyzer service.
Mutalyzer's position converter has a batch-feature that allows for the upload of
large files containing genomic postions. Through email you will be notified when
your batch job is done, after which you can download the result file.
Wiggle files are converted to Mutalyzer position converter batch files using the
wig2batchfile.php script. Please note that since Mutalyzer is a service aimed
at genetic variation, all positions are converted into genetic variants.
Batch files are created for the filtered 5' wiggle files only.

Usage:
./wig2batchfile.php WIGGLE_FILE

Example:
for file in *.filtered.wig5; do ./wig2batchfile.php "$file"; done

Produces:
- One Mutalyzer batch input file.


12) The position converter batch files are manually loaded into Mutalyzer:
https://v2.mutalyzer.nl/batchPositionConverter

Do not run more than a few files at a time. It will considerably slow down the
process, and will increase the chance of failures, after which you might need to
re-upload your file. Wait until Mutalyzer reports that the file has been
uploaded before uploading another file.

The test installation is slightly faster but there is no guarantee that it is
always online.


13) When done, download the result files from Mutalyzer. If you're not sure
anymore which results file belongs to which batchfile, you can match these files
by using the match_mutalyzer_output_to_batchfile.sh script.

Usage:
./match_mutalyzer_output_to_batchfile.sh \
    MUTALYZER_RESULTS_FILE [MUTALYZER_RESULTS_FILE [...]] \
    BATCH_FILE [BATCH FILE [...]]

Example:
./match_mutalyzer_output_to_batchfile.sh batch-job-* *mutalyzer_batchfile.txt

This script will rename the Mutalyzer results files according to the batch input
files.


ANALYSIS OF TRIPLET PERIODICITY
--------------------------------------------------------------------------------
14) The analyze_triplet_periodicity.php script reports the number of reads
relative to the annotated start codon. A standard feature of ribosome profiling
data is the presence of a major peak at position -12 relative to the annotated
start codon (harringtonin-treated samples), and an higher amount of reads whose
5' ends map to the first nucleotide of a codon (cycloheximide-treated samples).
In the output file of the analyze_triplet_periodicity.php script, numbers in
positions %1, %2 and %3 represent the sum of reads whose 5' end mapped to the
first, second and third nucleotide of the annotated codons, respectively.

Usage:
./analyze_triplet_periodicity.php MUTALYZER_RESULTS_FILE WIGGLE_FILE \
    mm10_gene_list.txt STRAND
(Valid values for STRAND: +, -, F and R)

Example:
./analyze_triplet_periodicity.php \
    A1.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \
    A1.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F \
    > A1.merged_wiggle.F.filtered.wig5.periodicity.txt

Produces:
- Direct output to the screen (which should then be saved to a file if needed)
  with detailed statistics about the mapping of the positions to the
  transcriptome by Mutalyzer, followed by the list of positions relative to the
  TIS, and the total coverage for each position. Of the positions in the coding
  region, only the first 6 are shown individually. All positions in the coding
  region, including the positions up to 15 bases upstream of the coding region,
  are grouped by their location in the annotated codons (first, second and third
  base), and displayed as positions %1, %2 and %3 respectively, with the summed
  coverages to assess the overall triplet periodicity.
  Positions starting with an asterisk (*) are positions in the 3' UTR. The given
  number is the relative distance to the stop codon: *1 is the first base after
  the stop codon.

The script does the following things:
+ Read out the coverage information from the wiggle file.
+ Read the gene list to find out which genes (transcripts, actually) are on
  which strand.
+ Read the Mutalyzer result file to see which locations map to which
  transcripts, determining the position within the transcript.
  * Filter low coverage positions (< 3).
  * Mappings on non-coding transcripts are ignored.
  * Mappings on transcripts on the other strand are ignored.
  * Transcripts not in the gene list are ignored and reported.
  * Filter positions without transcript mapping.
  * Filter positions with only intronic mapping.
  * Filter positions too far from translation start or stop sites (> 500bp).
  * The remaining possible mappings are processed.
    Please note that the "coding region" is defined as the region starting at
    -15 nucleotides from the ATG to the end of the open reading frame.
    Please note that this script does not see the difference between the UTR and
    the intergenic sequence. Both are referred to as "UTR mappings".
    - 5' UTR or 3' UTR mappings, while mappings to the coding region are also
      available, are discarded.
    - If there are 5' UTR and 3' UTR mappings, but none in the coding
      region, we assume an intergenic situation and if one of the two
      positions is clearly closer to the coding region than the
      other (> 100bp closer), then the closest position is picked.
      If not, the position is discarded, and reported to the user.
    - After this, all mappings are counted by their coverage and
      reported.


MERGING BIOLOGICAL REPLICATES PRIOR TO PEAK CALLING
(IDENTIFICATION OF TRANSLATION INITIATION SITES (TISs))
--------------------------------------------------------------------------------
15) To increase the statistical power and to lower background signal, merge the
biological replicates of the harringtonin-treated samples. Both the wiggle files
as well as the Mutalyzer result files, need to be merged.
Examples, assuming A and C are harringtonin samples:

Merging the wiggle files:

for sample in A C;
do
  for strand in F R;
  do
    wiggelen merge ${sample}?.merged_wiggle.${strand}.filtered.wig5 \
      | sed 's/^\([0-9]\+\) \([0-9]\+\)/\1\t\2/' \
      > merged_${sample}.merged_wiggle.${strand}.filtered.wig5;
    rm ${sample}?.merged_wiggle.${strand}.filtered.wig5.idx;
  done
done

Merging the Mutalyzer files:

for sample in A C;
do
  for strand in F R;
  do
    head -n 1 ${sample}1.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \
      > merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt
    grep -h "^chr" ${sample}?.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt \
      | sort -g | uniq \
      >> merged_${sample}.merged_wiggle.${strand}.filtered.wig5_mutalyzer_batchfile_results.txt
  done
done

The order of the variants in the Mutalyzer results file is sorted differently,
but this has no effect on the following analysis results.


PEAK CALLING: IDENTIFICATION OF PRIMARY ORFs, ALTERNATIVE ORFs and UPSTREAM ORFs
--------------------------------------------------------------------------------
16) The find_ORFs.php script is based on a local dynamic peak calling algorithm,
which tries to identify peaks (TISs) relative to the coverage of the surrounding
region.
NOTE: Run this script on the merged replicates of harringtonin samples, only.

Usage:
./find_ORFs.php MUTALYZER_RESULTS WIGGLE_FILE mm10_gene_list.txt STRAND
(Valid values for STRAND: +, -, F and R)

Example:
./find_ORFs.php \
  merged_A.merged_wiggle.F.filtered.wig5_mutalyzer_batchfile_results.txt \
  merged_A.merged_wiggle.F.filtered.wig5 mm10_gene_list.txt F

Produces:
- *.ORF_analysis_results_stats.txt
  Statistics on the ORF finding run; information on the loaded gene list file
  and wiggle file, mentions the number of positions left out on each filtering
  step, and mentions the number of genes left with found translation initiation
  sites.
- *.ORF_analysis_results.txt
  The results file shows, per gene, the number of positions found, the number
  actually analyzed as candidate peaks, and the number of TISs found. These are
  then reported with the genomic position, the coverage, and per transcript, the
  position relative to the annotated TIS on the transcript.
- *.ORF_analysis_results_after_cutoff.txt
  This file has the same format of the results file, but shows only TISs found
  in the annotated coding region, at least 5000 bases from the annotated TIS.
  (the value of 5000 can be configured in the script)
  For more information on this cutoff, please see de Klerk E., Fokkema I.F.A.C.
  et al (2015).

The script does the following things:
+ Read out the coverage information from the wiggle file
+ Read the gene list to find out which genes (transcripts, actually) are on
  which strand
+ Read the Mutalyzer result file to see which locations map to which
  transcripts, determining the position within the transcript
  * Filter low coverage positions (< 3)
  * Mappings on non-coding transcripts are ignored
  * Mappings on transcripts on the other strand are ignored
  * Transcripts not in the gene list are ignored and reported (so they could be
    fixed == added to the gene info file)
  * Filter positions without transcript mapping
  * Filter positions with only intronic mapping
  * Filter positions too far from translation start or stop sites (> 500bp)
  * The remaining possible mappings are processed.
    Please note that the "coding region" is defined as the region starting at
    -15 nucleotides from the ATG to the end of the open reading frame.
    Please note that this script does not see the difference between the UTR and
    the intergenic sequence. Both are referred to as "UTR mappings".
    - 3' UTR mappings, while mappings to the coding region or in the 5' UTR are
      also available, are discarded.
    - After this, all positions are stored, per gene, for the next step.
+ Peaks are searched for by walking through the positions, starting upstream.
  Each position, with at least a coverage of 20, is analyzed.
  (Please note, that 20 has been chosen for analysis of merged biological
    replicates. Otherwise, 10 was used.)
  * Its coverage should be higher than that of the positions located 3, 6, 9, 12
    and 15 nucleotides upstream of it.
  * Its coverage should be at least as high as that of the positions located 1
    and 2 nucleotides downstream of it.
  * It should show a triplet periodicity and a clear "harringtonin pattern":
    - The first nucleotide of the analyzed codon should have a coverage of at
      least 60% of the total coverage of that codon (configurable setting).
    - The following 5 codons should also not have a higher maximum coverage than
      this position's coverage.
    - If any of the following 5 codons have a maximum coverage higher than 10%
      of the coverage of the position we're analyzing, that codon must not show
      a conflicting triplet periodicity pattern (see first two points above).
  * If all these rules apply, the position is stored as a possible TIS.
+ Per gene, from all its found possible TISs, we will take the one with the
  highest coverage as a reference, and discard any other candidate TISs that do
  not have at least a coverage (on that position) of 10% of the reference
  (highest candidate).
+ All remaining candidate TISs will be reported.
+ Genes that have no candidate peaks left, are ignored.
+ For all genes remaining, the positions of the candidate TISs are reported
  relative to all known transcripts of the gene in question. The original number
  of positions with high coverage (>10) is mentioned along with the number of
  remaining candidate TISs.
  Results are displayed as: genomic position, coverage, position on transcript.
  The last column may be repeated, depending on the number of known transcripts.
+ Additionally, a separate file reports all candidate peaks found at a position
  higher than 5Kb from a known TIS, so they can easily be checked more
  thoroughly if they are false positives.


MERGING RESULT FILES
--------------------------------------------------------------------------------
17) To be able to statistically compare the possible switches in TIS usage, the
results of the ORF analyses need to be combined into one file, per strand (F and
R). The merge_ORF_files.php script merges the files into one summary file, but
does not take care of the strands, therefore make sure you don't mix forward and
reverse files with each other.

Usage:
./merge_ORF_files.php ORF_FILE [ORF_FILE [ORF_FILE [...]]]

Example:
./merge_ORF_files.php *.F.*.ORF_analysis_results.txt

Produces:
- One file which name depends on the given input files, ending in the suffix
  .merged_ORF_analyses.txt. The format is described below.

The script analyzes the file names to isolate the sample IDs. It assumes that
the sample IDs are the only differences between the input file names. The sample
IDs are also used in the headers of the output.

While grouping, it ignores the positions of the peaks on the transcripts. It
reports the chromosomal position, the coverage per sample, and the gene name.
Every peak that has been called in at least one sample, is displayed.

NOTE: If the original wiggle files of the individual biological replicates are
present in the same directory, these are all read out while merging the ORF
analysis result files, and the resulting file will report the individual
coverages of all biological replicates, per TIS peak.
If the wiggle files are not found, the coverages are taken from the ORF analysis
result files, which means when a coverage of 0 is reported, it simply means that
the position was not recognized as a TIS in that sample. The actual measured
coverage in that sample may not be 0.

Example output:
# Chromosome    Position    A1    A2    A3    C1    C2    C3    Gene
chr1            4857920     32    14    27     6     3    10    Tcea1
chr1            4857929     24     6    24     2     4     6    Tcea1


ANALYSIS OF SWITCHES IN TIS USAGE
--------------------------------------------------------------------------------
The output file of the merge_ORF_files.php script is used to test switches in
TIS usage. For this, the lme4.0 R package is used.

The input file for the following R code is a tab delimited file. The first line
containins a list of all sample names (in the below example, 6), separated by a
tab. All following lines contain as first column the gene symbols and then, per
sample, the actual individual coverages. The file is imported into R as a
dataframe, containing 6 columns and n rows (n=number of called peaks).

Example input for R:
A1    A2    A3    C1    C2    C3
Tcea1	32    14    27     6     3    10
Tcea1	24     6    24     2     4     6

Below is the R code used for the analysis.

library(lme4.0)
load("TIS_analysis.Rdata")
data_samples<-data_small[,2:7]
Myotubes <- grepl("C", names(data_samples))
coeff_summary_fit_col3_col4 <- list()
for(i in 1: length(list_data.multinom))
{
print(i)
mydata <- list_data.multinom [[i]]
loc <- mydata[,1]
N <- nrow( mydata)
M <- 6
weight <- colSums(mydata[,2:7])
if ((!weight[1] && !weight[2] && !weight[3]) || (!weight[4] && !weight[5] && !weight[6])) {
    print("No coverage available in at least one condition")
    next
}
dat <- as.vector(as.matrix(mydata[,2:7]))
loc_all <- rep(loc, M)
Myotubes_all <- rep(Myotubes, each=N)
weight_all <- rep(weight, each=N)
subj_all <- factor(rep(colnames(mydata[,2:7]), each=N))
fit<- lmer( dat/weight_all ~ 0 + Myotubes_all*loc_all - Myotubes_all + (loc_all|subj_all), family=binomial(), weight=weight_all)
fit0 <-  lmer( dat/weight_all ~ 0+loc_all + (loc_all|subj_all), family=binomial(), weight=weight_all)
coeff_summary_fit_col3_col4[[i]] <- coef(summary(fit))[(1+nrow(coef(summary(fit)))/2):nrow(coef(summary(fit))),3:4]
}
sink("coeff_summary_fit_col3_col4.csv")
coeff_summary_fit_col3_col4
sink()


TIS LOCATION AND MOTIF ANALYSIS
--------------------------------------------------------------------------------
18) The generate_stats_peaks_per_location.php script is used to (i) categorize
TISs based on their location relative to different gene regions, (ii) report
codon motifs, (iii) report the sequence of potential uORFs (sequence between
each TIS located in the 5'UTR and the annotated TIS. The script uses the ORF
analysis results files of the merged replicates. It takes all ORF analysis
results files, both the results after the cutoff as the ones before, and sums up
the number of peaks found per location: 5' UTR, annotated TIS, coding, 3' UTR,
or multiple.
"Multiple" means that the TIS location was mapping to multiple transcripts, and
in at least two of these, the location was different. An exception here is when
one of the locations where the TIS maps to, is "annotated_TIS". Then, it is
assumed that the other transcripts with other positions are not the translated
transcripts.

Note that to determine the locations of the reads, the positions that are given
by the find_ORFs.php script have to be increased by 12 bases, since these are
the 5' ends of the reads. This means that positions -12 through -10 are counted
as annotated TIS. Lower numbers are counted as 5'UTR, while higher numbers
(> -10) are counted as coding. Positions starting with an asterisk (*) are
counted as 3'UTR.

This also means that read start positions at the end of the coding region, less
than 12 bp away from the 3'UTR, are counted as coding while in fact in reality
they represent a read that lies in the 3'UTR. This can not be detected however,
because we don't know the length of the coding region of the transcripts.

To be able to report the TIS codon motifs and the sequence between each 5'UTR
TIS and the annotated TIS, this script downloads the RefSeq sequences of the
transcripts that the read was aligned to, automatically from the NCBI website
using the URL format:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_007386.2.gb&rettype=gb

To prevent repeated downloads on successive runs of the script, it requires an
"NM cache" directory where all downloaded sequences are stored and can be re-
used by successive runs. The name of the directory is currently configured in
the $_SETT settings array within the script. If the directory is not found or
not writable, the script will complain and refuse execution. When that happens,
check if the directory can be made writable. Otherwise create an empty directory
and set its name in the settings in the script source code.

Usage:
./generate_stats_peaks_per_location.php ORF_FILE [ORF_FILE [ORF_FILE [...]]]

Example:
./generate_stats_peaks_per_location.php *ORF_analysis_results*

Produces:
- *.ORF_analysis_results.stats_peaks_per_location.txt (one per sample)
  Reports the total number of TISs found as well as the total coverage of found
  TISs per gene region (5'UTR, annotated TIS, coding, 3'UTR, multiple). Both are
  reported since a high number of identified TISs may not correlate to also
  having high coverage in this region.
  The statistics are reported separately for all peaks before the cutoff and all
  peaks including the ones after the cutoff.
- *.ORF_analysis_results.peaks_classification.txt (one per sample)
  Reports all found TISs before the set cutoff, sorted on category, strand and
  genomic position. Other fields reported are the "real" genomic position of the
  TIS, calculated by adding (or substracting, for the negative strand) an offset
  of 12 nucleotides to the 5' read genomic positions, the gene, the summed
  coverage of the replicates in that position, the transcript RefSeq ID (only
  one reported), the position on that transcript including the position with the
  offset of 12 applied, optional status messages, and the sequence of the codon
  at the reported TIS.
  For TISs reported in the category "multiple", the information in the last four
  columns is missing.
  In case the RefSeq file has no CDS annotated, instead of the motif sequence
  the error "could_not_parse_CDS" is given.
  In case the RefSeq file has no 5'UTR annotated (CDS starts at position 1), or
  in case the RefSeq file has not enough 5'UTR annotated to fetch the full motif
  sequence (CDS starts at a position smaller than the distance of the TIS to the
  annotated TIS), the status is set to "no_5UTR" or "unannotated_5UTR",
  respectively, and a "slice" of the genomic sequence is downloaded. This slice
  starts from the corrected genomic position, and 75000 bases downstream. From
  this file, this script attempts to find the correct mRNA and CDS tags, takes
  the sequence until the annotated CDS, removing any annotated introns before
  the CDS. When the mRNA tag can not be found, the status also mentions the code
  "no_mRNA_definition". The motif is still reported, but the output file
  described below will not contain the sequence until the annotated TIS.
- *.ORF_analysis_results.peaks_classification_5UTR.txt (one per sample)
  Reports all TISs found in the 5'UTR, sorted on strand and genomic position.
  The fields reported are very similar to the fields mentioned in the previous
  file, with the following exceptions: the motif field is missing, and the
  following two fields are added: the DNA sequence starting at the TIS until the
  annotated TIS, and the translation of this sequence to protein sequence. If a
  found TIS can be mapped to different transcripts at different distances from
  the annotated TIS, all these transcripts are reported on a separate line.
  The protein sequence can contain a *, which indicates a stop-codon. A ?
  indicates an incomplete or unrecognizable codon. This should happen only if
  the sequence is in a different frame compared with the annotated TIS; in this
  case the DNA sequence given will end with an incomplete codon and the trans-
  lated sequence will end with a ?.
  Any messages that might be reported in the status column, are explained above
  in the description of the previous file.

The script is checking the names of input files to automatically ignore its own
result files, the statistics files generated by the find_ORFs.php script and
merged ORF files generated by the merge_ORF_files.php script. Input file names
need to contain the strand information ('.F.' or '.R.') and must end with either
'ORF_analysis_results.txt' or 'ORF_analysis_results_after_cutoff.txt'. Files are
grouped by their file name's prefix (until the strand information). As such, per
sample the script can find four files; two forward strand and two reverse
strand; each a file with peaks before cutoff, and one with peaks after the
cutoff.
If files are passed for which the file name is not recognized, they are reported
and the script will stop processing.


19) The result files of the generate_stats_peaks_per_location.php script contain
the motif sequences of every found TIS. The code to analyze the motif sequences
used is below. If needed, change the first line to use the sample names you wish
to analyze.

for sample in A C;
do
  echo "Analyzing sample ${sample}:" > merged_${sample}.motif_analysis_results.txt;
  for category in 5UTR unannotated_5UTR annotated_TIS coding 3UTR multiple;
  do
    if [[ $category == '5UTR' ]];
    then
      TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
      TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
    elif [[ $category == 'unannotated_5UTR' ]];
    then
      TOTAL=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
      TOTAL_COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
    else
      TOTAL=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | wc -l`;
      TOTAL_COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s...$" | grep -v '^#' | cut -f 1 | paste -sd+ | bc`;
    fi
    echo -e "${category}\tTotal TISs:\t${TOTAL}\tTotal coverage:\t${TOTAL_COVERAGE}";
    echo -e "Motif\tTISs\tCoverage\tPercTISs\tPercCoverage";
    if [[ $category == '5UTR' ]];
    then
      MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
    elif [[ $category == 'unannotated_5UTR' ]];
    then
      MOTIFS=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
    else
      MOTIFS=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep -v '^#' | grep -v '^$' | sort | uniq`;
    fi
    for motif in $MOTIFS;
    do
      if [[ $category == '5UTR' ]];
      then
        TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 11 | grep "^${motif}$" | wc -l`;
        COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | grep -vE "(unannotated|no)_5UTR" | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
      elif [[ $category == 'unannotated_5UTR' ]];
      then
        TISs=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`;
        COVERAGE=`grep -E "(unannotated|no)_5UTR" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
      else
        TISs=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 11 | grep "^${motif}$" | wc -l`;
        COVERAGE=`grep "${category}" merged_${sample}.merged_wiggle.ORF_analysis_results.peaks_classification.txt | cut -f 6,11 | grep -E "^[0-9]+\s${motif}$" | cut -f 1 | paste -sd+ | bc`;
      fi
      PERC_TISs=`echo "scale=2; $TISs * 100 / $TOTAL" | bc`;
      PERC_COVERAGE=`echo "scale=2; $COVERAGE * 100 / $TOTAL_COVERAGE" | bc`;
      echo -e "${motif}\t${TISs}\t${COVERAGE}\t${PERC_TISs}%\t${PERC_COVERAGE}%";
    done
  done >> merged_${sample}.motif_analysis_results.txt;
done

Produces:
- Per sample, one file containing per TIS category (5UTR, unannotated_5UTR,
  annotated_TIS, coding, 3UTR, multiple): the analyzed motif, the number of TISs
  with this motif, the total coverage of TISs with this motif, the percentage of
  TISs with this motif, and the percentage of reads on TISs with this motif.