Skip to content

Commit

Permalink
Merge pull request #3 from dmitrymyl/module_development
Browse files Browse the repository at this point in the history
Module development
  • Loading branch information
dmitrymyl authored Jun 30, 2021
2 parents 820ac69 + daec9e2 commit 940504b
Show file tree
Hide file tree
Showing 90 changed files with 3,685 additions and 24,793 deletions.
58 changes: 3 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,61 +2,9 @@
Sequence alignment tool based on syntenic protein neighbourhood derived from OrthoDB.

# What is it?
`ortho2align` is the package for alignmnent of nucleotide sequences from one species against another species' genome. It uses OrthoDB data of orthologous proteins to construct regions of quasi-synteny to utilise them as guides for alignment.

# For what tasks this package can be used?
* general alignment of any sequences from the query genome to the target genome;
* alignment of low-conservative sequences to the target genome (ncRNAs etc.).
# What can it be used for?

# Dependencies
The package was tested under following dependencies:
* python 3.6
* pandas 0.24?
* blat v. 35
* bedtools 2.25?
* OrthoDB v10
# Installation

# How to run
TODO: add sample code to run:
1. Getting orthodb map file.
2. Running ortho2align.py

# How does it work?
Input files are:
* genome of query species
* genome of target species
* protein annotation of query species
* genome annotation of target species
* OrthoDB map file
* coordinates of query sequences in the genome of query species

First, the packages inferes protein neighbourhood of query sequences in the genome of query species at the given radius with `bedtools window -w radius`. Next, it retrieves orthologs of neighbouring proteins in the target genome and construct quasi-syntenic regions. In case orthologous protein in the target genome are placed within merge distance, then they will be merged. After there are no possibilities to merge, derived syntenic ranges can be flanked to some extent. This might be helpful in case protein neihbourhood of one query sequence contain only one protein. There might be more than one quasi-syntenic regions for one query sequence due to paralogues.

TODO: add image of synteny estimation.

Query sequences and quasi-syntenic regions are extracted from corresponding genomes with `bedtools getfasta` one fasta file per sequence with names following the pattern: chromosome:start-end(strand), where chromosome corresponds to fasta headers in genome files, start and end are integer genomic positions and strand is either "+", "-" or "." (without quotes).

Then query sequences are aligned against their syntenic regions with BLAT. User can define tile size and minimal identity of sequences to report. BLAT was chosen for its convenient psl3 format of alignments, that provides exon-intron-like structure of aligned regions. Alignment of many sequences can take a lot of time so user can specify how many cores can be used for alignment process.

Main output file is a `json` containing array of dictionary records, one query sequence per record. Each record contains information about query sequence,
protein neighbourhood, syntenic regions and found alignments. Intermediate files are created in the working directory. THe main working format is json due to its convenient interoperability with python data structures.

# Structure of the package
All scripts produce json files.
* `get_neighbourhood.py` retrieves protein neighbourhood of query sequences in query genome based on supplied protein annotation in gtf|gff format.
* `extract_mapping.py` extracts orthodb mapping data from bulk file for one species.
* `chromsizes_fasta.py` gets chromosome sizes of given genome in fasta format and returns them in json format.
* `annotation2json.py`translates given genome annotation of target species into json format as a list of dictionaries.
* `map_synteny.py`maps protein neighbourhood to orthologs in the target species and compose syntenic ranges.
* `get_fasta.py` retrieves query and syntenic target sequences one per file from given query and target genomes.
* `grid_alignment.py`performs alignment of query sequences to target syntenies.
* `ortho2align.py` master script to ~~rule them all~~ run listed above scripts in sequential manner. All output files produced by the scripts have fixed names, so user can run each step separately as long as one follows naming conventions.

TODO: add image of data flow within scripts.

# TODO
* complete README.md
* add synteny map example
* add orthodb file processing
* add examples folder
* correct software versions
# Usage
9 changes: 9 additions & 0 deletions configs/annotate_orthologs.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
annotate_orthologs
-subject_orthologs
SUBJECT_ORTHOLOGS
-subject_annotation
SUBJECT_ANNOTATION
-subject_name_regex
SUBJECT_NAME_REGEX
-output
OUTPUT
11 changes: 11 additions & 0 deletions configs/bg_from_inter_ranges.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
bg_from_inter_ranges
-genes
GENES_FILENAME
-name_regex
None
-sample_size
SAMPLE_SIZE
-seed
0
-output
OUTPUT_FILENAME
13 changes: 13 additions & 0 deletions configs/bg_from_shuffled_ranges.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
bg_from_shuffled_ranges
-genes
GENES_FILENAME
-genome
GENOME_FILENAME
-name_regex
None
-sample_size
SAMPLE_SIZE
-seed
123
-output
OUTPUT_FILENAME
17 changes: 17 additions & 0 deletions configs/build_orthologs.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
build_orthologs
-alignments
ALIGNMENTS
-background
BACKGROUND
-fitting
kde
-threshold
0.05
--fdr
-outdir
OUTDIR
-cores
1
-timeout
None
--silent
24 changes: 24 additions & 0 deletions configs/estimate_background.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
estimate_background
-query_genes
QUERY_GENES_FILENAME
-bg_ranges
BG_RANGES_FILENAME
-query_genome
QUERY_GENOME_FILENAME
-subject_genome
SUBJECT_GENOME_FILENAME
-query_name_regex
None
-bg_name_regex
None
-word_size
6
-observations
1000
-outdir
OUTDIR
-cores
1
-seed
123
--silent
24 changes: 24 additions & 0 deletions configs/get_alignments.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
get_alignments
-query_genes
QUERY_GENES
-query_genome
QUERY_GENOME
-subject_genome
SUBJECT_GENOME
-query_name_regex
None
-liftover_chains
LIFTOVER_CHAINS
-outdir
OUTDIR
-min_ratio
0.05
-word_size
6
-merge_dist
2000000
-flank_dist
50000
-cores
1
--silent
15 changes: 15 additions & 0 deletions configs/get_best_orthologs.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
get_best_orthologs
-query_orthologs
QUERY_ORTHOLOGS
-subject_orthologs
SUBJECT_ORTHOLOGS
-value
block_length
-function
max
-outfile_query
OUTFILE_QUERY
-outfile_subject
OUTFILE_SUBJECT
-outfile_map
OUTFILE_MAP
46 changes: 46 additions & 0 deletions configs/run_pipeline.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
run_pipeline
-query_genes
QUERY_GENES
-query_genome
QUERY_GENOME
-subject_annotation
SUBJECT_ANNOTATION
-subject_genome
SUBJECT_GENOME
-query_name_regex
None
-subject_name_regex
None
-liftover_chains
LIFTOVER_CHAINS
-outdir
OUTDIR
-cores
1
-word_size
6
-seed
0
--silent
--annotate
-sample_size
200
-observations
1000
-min_ratio
0.05
-merge_dist
2000000
-flank_dist
50000
-fitting
kde
-threshold
0.05
--fdr
-timeout
None
-value
block_length
-function
max
20 changes: 0 additions & 20 deletions docs/Makefile

This file was deleted.

Binary file removed docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file removed docs/build/doctrees/index.doctree
Binary file not shown.
Binary file removed docs/build/doctrees/modules.doctree
Binary file not shown.
Binary file removed docs/build/doctrees/ortho2align.doctree
Binary file not shown.
4 changes: 0 additions & 4 deletions docs/build/html/.buildinfo

This file was deleted.

20 changes: 0 additions & 20 deletions docs/build/html/_sources/index.rst.txt

This file was deleted.

7 changes: 0 additions & 7 deletions docs/build/html/_sources/modules.rst.txt

This file was deleted.

62 changes: 0 additions & 62 deletions docs/build/html/_sources/ortho2align.rst.txt

This file was deleted.

Loading

0 comments on commit 940504b

Please sign in to comment.