This repository provides tools for improving the sensitivity and specificity of pairwise genome alignments [1,2,3]. These tools make use of lastz and the alignment chain and net concept [4].
These tools integrate into the standard lastz/chain/net workflow of genome alignment as follows:
- genome-wide local alignments with lastz
- building alignment chains
- NEW: highly-sensitive local alignments to improve alignment chains by detecting remote homologies
- NEW: chainCleaner to improve the specificity in alignment chains
- NEW: chainNet with parameter -rescore to compute exact scores of all nets
- NEW: non-nested net filtering to keep nets that likely represent orthologous alignments
- install lastz
- install the Kent (UCSC) source code by doing
# go to whereever you want to have the Kent source code
cd /path/to/
# git clone and make the libraries
git clone https://github.com/ucscGenomeBrowser/kent.git kent
cd kent
export USE_SAMTABIX=0
export USE_SSL=1
cd src && make libs && cd ../..
These binaries should be located in a directory that is added to your $PATH environment variable.
The perl and python scripts just need to be copied to a directory that is contained in your $PATH environment variable. Only chainCleaner and the modified chainNet (with the -rescore option) need to be compiled. Since both tools need the Kent (UCSC) source code, you need to set the KENTSRC_DIR variable to the path to the kent source code (should end with kent/src/).
export KENTSRC_DIR=/path/to/kent/src/
Afterwards, compile chainCleaner, chainNet and scoreChain
git clone https://github.com/hillerlab/GenomeAlignmentTools.git
cd GenomeAlignmentTools/src
export MACHTYPE=x86_64
make
The subdirectory precompiledBinary_x86_64/ already contains Linux x86_64 compiled binaries.
RepeatFiller [5] is a tool to incorporate newly-detected repeat-overlapping alignments into pairwise alignment chains [4]. Its runtime adds little to the computationally more expensive step of generating chains in pairwise whole-genome alignments. RepeatFiller circumvents the problem that considering all repeat-overlapping alignment seeds during whole genome alignment is computationally not feasible. Therefore, RepeatFiller only aligns local genomic regions that are bounded by colinear aligning blocks, as provided in the chains, which makes it feasible to consider all seeds including those that overlap repetive regions. RepeatFiller application to mammalian genome alignment chains can add between 22 and 84 Mb of previously-undetected alignments that mostly originate from transposable elements [5]. This helps to comprehensively align repetitive regions and improves the annotation of conserved non-coding elements.
Usage:
RepeatFiller.py -c in.chain -T2 target.2bit -Q2 query.2bit [options]
in.chain alignment file in chain format
target.2bit full path to a .2bit file of reference species
query.2bit full path to a .2bit file of query species
[options] call RepeatFiller.py --help to see the full list of parameters controlling
size and score thresholds of chains, and lastz and chaining options
example: RepeatFiller.py -c hg38.speTri2.all.chain -T2 hg38.2bit -Q2 speTri2.2bit
patchChain.perl performs a highly sensitive local pairwise alignment for loci flanked by aligning blocks [1,3]. Given an alignment chain [4], it considers all chains that pass the score and span filters (optional parameters), extracts all the unaligning loci and creates local alignment jobs. After executing these alignment jobs, the newly found and the original local alignments are combined and used to produce a new set of improved chains.
This procedure is recommended for comparisons between species that are separated by >0.75 substitutions per neutral site [1].
Usage:
patchChain.perl chain.infile 2bit_target_fullPath 2bit_query_fullPath \
chrom.sizes_target_fullPath chrom.sizes_query_fullPath [options]
Call patchChain.perl without any parameters to see the full parameter list.
Example: Here, we use the parameters for highly sensitive local alignments from [1] and the HoxD55 scoring matrix (provided by the kent source code (subdirectory src/blatz/HoxD55.q); also contained in the example directory).
patchChain.perl example/hg38.danRer10.chain example/hg38.2bit example/danRer10.2bit example/hg38.chrom.sizes example/danRer10.chrom.sizes \
-chainMinScore 5000 -gapMaxSizeT 500000 -gapMaxSizeQ 500000 -gapMinSizeT 30 -gapMinSizeQ 30 \
-numJobs 10 -jobDir jobs -jobList jobList -outputDir pslOutput \
-minEntropy 1.8 -windowSize 30 -minIdentity 60 \
-lastzParameters "--format=axt K=1500 L=2500 M=0 T=0 W=5 Q=example/HoxD55.q"
# this results in 10 alignment jobs that are located in jobs/ and listed in 'jobList'
# before executing, download the human and zebrafish genome
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
wget http://hgdownload.cse.ucsc.edu/goldenPath/danRer10/bigZips/danRer10.2bit
mv danRer10.2bit hg38.2bit example/
# now execute these jobs on a compute cluster or run them sequentially by doing 'chmod +x jobList; ./jobList'
# concatenate all new results
find pslOutput -name "*.psl" | xargs -i cat {} > newAlignments.psl
# combine the genome-wide lastz results (the combined psl file that was used to create the input chains) and the newly found psl alignments
cat genomeWide.lastz.psl newAlignments.psl > all.psl
# use axtChain from the Kent source to compute alignment chains that include the new alignments
chainCleaner improves the specificity in genome alignment chains by detecting and removing local alignments that obscure the evolutionary history of genomic rearrangements [2]. The input is a chain file, ideally after adding alignments found with highly sensitive parameters if distal species are compared. The output is a chain file that contains re-scored and score-sorted chains after removing the local alignments from the parent chains and adding them as individual chains. The resulting output file can be used to get alignment nets by running chainNet [4].
Usage:
chainCleaner in.chain tNibDir qNibDir out.chain out.bed -net=in.net
OR
chainCleaner in.chain tNibDir qNibDir out.chain out.bed -tSizes=/dir/to/target/chrom.sizes -qSizes=/dir/to/query/chrom.sizes
First option: you have netted the chains and specify the net file via -net=netFile
Second option: you have not netted the chains. Then chainCleaner will net them. In this case, you must specify the chrom.sizes file for the target and query with -tSizes/-qSizes
tNibDir/qNibDir are either directories with nib files, or the name of a .2bit file
output:
out.chain output file in chain format containing the untouched chains, the original broken chain and the modified breaking chains. NOTE: this file is chainSort-ed.
out.bed output file in bed format containing the coords and information about the removed chain-breaking alignments.
Call chainCleaner without any parameters to see the full parameter list.
Example run of chainCleaner: Before executing, download the human and mouse genome.
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit
mv hg38.2bit mm10.2bit example/
Run chainCleaner on the chains provided in example/
chainCleaner example/hg38.mm10.chr1.chain.gz -tSizes=example/hg38.chrom.sizes \
-qSizes=example/mm10.chrom.sizes example/hg38.2bit example/mm10.2bit \
example/hg38.mm10.chr1.cleaned.chain example/removedSuspects.bed -linearGap=loose
Verbosity level: 1
foldThreshold: 0.000000 LRfoldThreshold: 2.500000 maxSuspectBases: 2147483647 maxSuspectScore: 100000 minBrokenChainScore: 50000 minLRGapSize: 0
0. need to net the input chains example/hg38.mm10.chr1.chain.gz (no net file given) ...
tempfile for netting: tmp.chainCleaner.XLJ2Vxs.net
Got 455 chroms in example/hg38.chrom.sizes, 66 in example/mm10.chrom.sizes
Finishing nets
writing stdout
writing /dev/null
DONE (nets in tmp.chainCleaner.XLJ2Vxs.net)
1. parsing fills/gaps from tmp.chainCleaner.XLJ2Vxs.net and getting valid breaks ...
1.1 read net file tmp.chainCleaner.XLJ2Vxs.net into memory ...
DONE
1.2 get fills/gaps from tmp.chainCleaner.XLJ2Vxs.net ...
DONE
1.3 get aligning regions from tmp.chainCleaner.XLJ2Vxs.net ...
DONE
1.4 get valid breaks ...
DONE
Remove temporary netfile tmp.chainCleaner.XLJ2Vxs.net
DONE (parsing fills/gaps and getting valid breaks)
2. reading breaking and broken chains from example/hg38.mm10.chr1.chain.gz and write irrelevant chains to example/hg38.mm10.chr1.cleaned.chain.unsorted ...
DONE
3. reading target and query DNA sequences for breaking and broken chains ...
DONE
4. loop over all breaks. Remove suspects if they pass our filters and write out deleted suspects to example/removedSuspects.bed ...
DONE
5. write the (new) breaking and the broken chains to example/hg38.mm10.chr1.cleaned.chain.unsorted ...
DONE
6. chainSort example/hg38.mm10.chr1.cleaned.chain.unsorted example/hg38.mm10.chr1.cleaned.chain ...
DONE
7. free memory ...
DONE
memory usage 5562015744, utime 1058 s/100, stime 256
ALL DONE. New chains are in example/hg38.mm10.chr1.cleaned.chain. Deleted suspects in example/removedSuspects.bed
Given a set of alignment chains, chainNet produces alignment nets, which is a hierarchical collection of chains or parts of chains that attempt to capture only orthologous alignments [4]. The original chainNet implementation approximates the score of "sub-nets" (nets that come from a part of a chain and fill a gap in a higher-level net) by the fraction of aligning bases. This can lead to a bias in case the aligning blocks of a chain are not equally distributed. We implemented a new parameter "-rescore" in chainNet that computes the real score of each subnet [2].
Usage:
chainNet - Make alignment nets out of chains
usage:
chainNet in.chain target.sizes query.sizes target.net query.net
where:
in.chain is the chain file sorted by score
target.sizes contains the size of the target sequences
query.sizes contains the size of the query sequences
target.net is the output over the target genome
query.net is the output over the query genome
options:
-rescore compute the real score of the sub-net (instead of approximating it based on the fraction of aligning bases in the subnet)
The real score will be much more precise especially for imbalanced chains where most aligning blocks are on one side.
This flag will set minScore=0. Each subnet with a negative score gets score 1. Afterwards, run a non-nested score filter.
Note: Rescoring is only implemented for the target species net.
With this flag, you need to give the target and query genome sequence (-tNibDir and -qNibDir) and specify -linearGap
-tNibDir=fileName target genome file (2bit or nib format)
-qNibDir=fileName query genome file (2bit or nib format)
Call chainNet without any parameters to see the full parameter list.
Before building a multiple alignment from the pairwise alignment nets, it is recommended to remove low-scoring alignment nets that are unlikely to represent real homologies. While the netFilter program [4] removes nested nets in case their parent net does not satisfy the specified score and size criteria, NetFilterNonNested.perl applies a non-nested filtering procedure that considers and filters each net individually [1,3]. This avoids removing nested nets that would satisfy the specified criteria, even if a parent net is removed.
Usage: In [1], we applied the following filter criteria:
- UCSC "syntenic net" criteria (thresholds: minTopScore=300000, minSynScore=200000, minSynSize=20000, minSynAli=10000, maxFar=200000) in addition to keeping all nested nets that align
to the same locus (inversions or local translocations) if they score higher than 5000 were applied to all placental mammal alignments that have well-assembled genomes
NetFilterNonNested.perl -doUCSCSynFilter -keepSynNetsWithScore 5000 -keepInvNetsWithScore 5000 ref.query.net.gz > ref.query.filtered.net
- keeping nets that score higher than 100000 and keeping all nested nets that align to the same locus if they score higher than 5000 for placental mammals with less-well assembled genomes
NetFilterNonNested.perl -doScoreFilter -keepSynNetsWithScore 5000 -keepInvNetsWithScore 5000 -minScore1 100000 ref.query.net.gz > ref.query.filtered.net
- keeping nets that score higher than 10000 and keeping all nested nets that align to the same locus if they score higher than 3000 for non-placental mammals
NetFilterNonNested.perl -doScoreFilter -keepSynNetsWithScore 3000 -keepInvNetsWithScore 3000 -minScore1 10000 ref.query.net.gz > ref.query.filtered.net
Call NetFilterNonNested.perl without any parameters to see all filtering options.
[1] Sharma V, Hiller M. Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res., 45(14), 8369–8377, 2017
[2] Suarez H, Langer BE, Ladde P, Hiller M. chainCleaner improves genome alignment specificity and sensitivity. Bioinformatics, 33(11):1596-1603, 2017
[3] Hiller M, Agarwal S, Notwell JH, Parikh R, Guturu H, Wenger AM, Bejerano G. Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish. Nucleic Acids Res, 41(15):e151.
[4] Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. PNAS, 100(20):11484-9, 2003
[5] Osipova E, Hecker N, Hiller M. RepeatFiller newly identifies megabases of aligning repetitive sequences and improves annotations of conserved non-exonic elements, submitted