Skip to content

Running the orthology search

felixlangschied edited this page Feb 16, 2023 · 2 revisions

Overview

Covariance models

  • Path to directory that contains a covariance model for each reference miRNA
  • Filename of the model has to exactly match the identifier in the Tab separated miRNA input file
  • Covariance model file has to have the ending ".cm"
  • All other files in the directory are disregarded by ncOrtho

Reference miRNAs

Information about all reference miRNAs has to be supplied in a single Tab separated file with 7 columns:

  1. Unique miRNA id
  2. Contig/Chromosome id (needs to match the one in the reference GFF file!)
  3. Start
  4. Stop
  5. Strand (+ or -)
  6. pre-miRNA sequence
  7. mature miRNA sequence (no features that use the mature sequence are as of yet implemented. Therefore this column can be filled with a placeholder like "NA" or "None")

Example:

hsa-mir-552	NC_000001.11	34669599	34669694	-	AACCAUUCAAAUAUACCACAGUUUGUUUAACCUUUUGCCUGUUGGUUGAAGAUGCCUUUCAACAGGUGACUGGUUAGACAAACUGUGGUAUAUACA	NA
hsa-mir-30e	NC_000001.11	40754355	40754446	+	GGGCAGUCUUUGCUACUGUAAACAUCCUUGACUGGAAGCUGUAAGGUGUUCAGAGGAGCUUUCAGUCGGAUGUUUACAGCGGCAGGCUGCCA	NA
hsa-mir-30c-1	NC_000001.11	40757284	40757372	+	ACCAUGCUGUAGUGUGUGUAAACAUCCUACACUCUCAGCUGUGAGCUCAAGGUGGCUGGGAGAGGGUUGUUUACUCCUUCUGCCAUGGA	NA
hsa-mir-6733	NC_000001.11	43171652	43171712	-	GUGCUUGGGAAAGACAAACUCAGAGUUCCCUUCUUGUGAGCUCAGUGUCUGGAUUUCCUAG	NA

You can retrieve this information from popular databases like miRBase.

Fasta data

  • Genomic sequence in FASTA format (e.g "genomic.fna" from RefSeq or "dna.toplevel.fa" from Ensembl)

Full help text

#########################################################
###                                                   ###
###   ncOrtho - ortholog search for non-coding RNAs   ###
###                                                   ###
#########################################################

usage: ncSearch [-h] -m <path> -n <path> -o <path> -q <.fa> -r <.fa> [--queryname [str]] [--cpu [int]] [--cm_cutoff [float]] [--minlength [float]] [--heuristic [True/False]]
                [--heur_blast_evalue [float]] [--heur_blast_length [float]] [--cleanup [True/False]] [--refblast [<path>]] [--queryblast [<path>]] [--maxcmhits [int]] [--dust [yes/no]]
                [--checkCoorthologsRef [True/False]]

Find orthologs of reference miRNAs in the genome of a query species.

Required Arguments:
  -m <path>, --models <path>
                        Path to directory containing covariance models (.cm)
  -n <path>, --ncrna <path>
                        Path to Tab separated file with information about the reference miRNAs
  -o <path>, --output <path>
                        Path to the output directory
  -q <.fa>, --query <.fa>
                        Path to query genome in FASTA format
  -r <.fa>, --reference <.fa>
                        Path to reference genome in FASTA format

Optional Arguments:
  --queryname [str]     Name for the output directory (RECOMMENDED)
  --cpu [int]           Number of CPU cores to use (Default: all available)
  --cm_cutoff [float]   CMsearch bit score cutoff, given as ratio of the CMsearch bitscore of the CM against the refernce species (Default: 0.5)
  --minlength [float]   CMsearch hit in the query species must have at least the length of this value times the length of the refernce pre-miRNA (Default: 0.7)
  --heuristic [True/False]
                        Perform a BLAST search of the reference miRNA in the query genome to identify candidate regions for the CMsearch. Majorly improves speed. (Default: True)
  --heur_blast_evalue [float]
                        Evalue filter for the BLASTn search that determines candidate regions for the CMsearch when running ncOrtho in heuristic mode. (Default: 0.5) (Set to 10 to turn off)
  --heur_blast_length [float]
                        Length cutoff for BLASTN search with which candidate regions for the CMsearch are identified.Cutoff is given as ratio of the reference pre-miRNA length (Default: 0.5) (Set to 0
                        to turn off)
  --cleanup [True/False]
                        Cleanup temporary files (Default: True)
  --refblast [<path>]   Path to BLASTdb of the reference species
  --queryblast [<path>]
                        Path to BLASTdb of the query species
  --maxcmhits [int]     Maximum number of cmsearch hits to examine. Decreases runtime significantly if reference miRNA in genomic repeat region. Set to empty variable to disable (i.e. --maxcmhits=None,
                        default)
  --dust [yes/no]       Use BLASTn dust filter during re-BLAST. Greatly decreases runtime if reference miRNA(s) are located in repeat regions. However ncOrtho will also not identify orthologs for these
                        miRNAs
  --checkCoorthologsRef [True/False]
                        If the re-blast does not identify the original reference miRNA sequence as best hit,ncOrtho will check whether the best blast hit is likely a co-ortholog of the reference miRNA
                        relative to the search taxon. NOTE: Setting this flag will substantially increasethe sensitivity of HaMStR but most likely affect also the specificity, especially when the
                        search taxon is evolutionarily only verydistantly related to the reference taxon (Default: False)

Content

Introduction

Covariance Model construction

Ortholog Search

Downstream

Support

Clone this wiki locally