forked from bioprojects/orderedPainting
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
executable file
·217 lines (137 loc) · 7.04 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
This code is written by Koji Yahara, and
implements the algorithm described in the following paper:
Koji Yahara, Xavier Didelot, M. Azim Ansari, Samuel K. Sheppard and Daniel Falush
Efficient inference of recombination hot regions in bacterial genomes
Molecular Biology and Evolution (2014)
orderedPainting is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by
the Free Software Foundation.
orderedPainting is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
#---------------------------------------------------------
# 1. Setting of your grid engine
#---------------------------------------------------------
orderedPainting is implemented based on a grid engine
(UGE/SGE or LSF) which can execute array jobs.
The default setting is UGE/SGE. If you would like to use
LSF instead, please edit "lib/env_func.bashrc" as below
#QUEUE_TYPE="SGE" # comment out
#QSUB_COMMON="qsub -cwd " # comment out
#MAX_MEMORY="-l s_vmem=8G -l mem_req=8" # comment out
QUEUE_TYPE="LSF" # comment in instead
QSUB_COMMON="bsub " # comment in instead
MAX_MEMORY="-M 8000000" # comment in instead
#---------------------------------------------------------
# 2. Setup
#---------------------------------------------------------
Please execute "setup.sh" as follows.
It will set programs used in orderedPainting.sh
(e.g., download and compile ChromoPainter and GNU sort).
$ /bin/bash setup.sh
Compilation of one of the programs (GNU sort) requires a few minutes.
Please wait during that time.
#---------------------------------------------------------
# 3. Basic usage
#---------------------------------------------------------
The following is the most basic example which uses files in the "example1" directory.
$ qsub -cwd -o example1.log -e example1.log <<< "/bin/bash ./orderedPainting.sh -g example1/simulatedData_N50.hap -l example1/strainName_N50.txt"
(If you use LSF, please use "bsub" instead of "qsub -cwd")
The format of the ".hap" file is the same as ChromoPainter.
The ".list" is a 1-column file which indicates name of each strain in the ".hap" file.
#---------------------------------------------------------
# 4. Output
#---------------------------------------------------------
Please check the log file you specified above.
$ tail example1.log
If the program finished running normally, the following message
will be written in the log file
Done. Please look at output files in simulatedData_N50_orderedS1_rnd_1_10_dirs_results
Please look at the output files there.
$ ls simulatedData_N50_orderedS1_rnd_1_10_dirs_results/
The following files and directories will be seen.
|
|-- results_siteStats.txt.gz
|-- results_siteStats_along_seq.png
|-- results_siteStats_hist.png
|-- results_siteStats_summary.pos.txt
|
|-- (results_siteStats_missingCnt.png
, which will be produced only if imputed data are analyzed)
If "-x" was specified, the following files and directories are also created.
|
|-- sum_site_minus_average.summary.range.txt
|-- sum_site_minus_average.summary.txt.gz
|
|-- visualize_bottom/
|-- visualize_bottom.list
|-- visualize_middle/
|-- visualize_middle.list
|-- visualize_top/
|-- visualize_top.list
"results_siteStats.txt.gz" contains values of the distance statistic and bootstrap support of each site.
"results_siteStats_along_seq.png" shows intensity of recombination (distribution of the distance statistic)
along the genome. Two recombination hot regions will be seen.
"results_siteStats_hist.png" shows a histogram of the distance statistic.
"results_siteStats_summary.pos.txt" contains information of representative sites picked up from
the top/middle/bottom of the distribution.
Other log files are produced temporarily in a current directory, and then moved
to the "log" directory. If the output files are successfully obtained,
there is no need to check these additional log files.
#---------------------------------------------------------
# 5. Usage for imputed data
#---------------------------------------------------------
If you would like to use imputed data, please prepare a file which indicates
position[tab]name of missing strain at the position
For example,
64 strain1
64 strain5
100 strain1
indicates strain1 and strain5 are missing at position 64, and
strain 1 is missing at position 100.
Please specify the file by "-m" option in orderedPainting.sh
The file is used to calculate site-by-site and average copying probability matrices
by masking columns and rows of missing individuals at a site.
#---------------------------------------------------------
# 6. Re-execution
#---------------------------------------------------------
If orderedPainting stopped during its computation, for example
by an occasional problem of your grid engine (such as incomplete
execution of a part of array jobs), please re-execute the command
as previously after fixing the problem.
It should restart computation at an appropriate step and
proceed to the end. If there are incomplete temporary files
in output directories, please remove them before re-execution.
#---------------------------------------------------------
# 7. Disk and memory usage
#---------------------------------------------------------
The disk usage of orderedPainting depends on the number of
SNPs and *squared* number of individuals, and it can produce
a large amount of temporary data from a large input dataset.
For example, 200 genomes with 100,000 SNPs require about
20GB per ordering (forward or reverse).
The temporary disk space is required to store and format
uncompressed output files of ChromoPainter without using
a large memory. The files are compressed during post-processing.
The number of orderings processed simultaneously in the step
can be specified by "-n" option (default is 20).
Depending on limitation of disk in your environment,
please use the "-n" option to specify a smaller number,
although it takes longer time to process all of the orderings.
Meanwhile, the memory usage of orderedPainting depends on
only the number of SNPs but not on that of individuals.
For example, about 360,000 SNPs require about 5.5GB RAM,
which is used to combine results of all orderings and
examine each SNP in terms of intensity of recombination
(the distance statistic and bootstrap support).
The upper limit of memory usage for your grid engine can
be speficied by "MAX_MEMORY" in "lib/env_func.bashrc" above.
#---------------------------------------------------------
# 8. Further information
#---------------------------------------------------------
If orderedPainting doesn't work in your computational cluster,
please send your input files to [email protected]
Koji Yahara will do the computation, remove the input files, and
send outputs to you.
If you find any other problem, please inform to [email protected]