-
Notifications
You must be signed in to change notification settings - Fork 65
/
example.txt
131 lines (83 loc) · 6.36 KB
/
example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
This example uses real public data from three Shigella sonnei genomes, two of which were
sequenced twice and so allow for reproducibility testing.
As Shigella sonnei is a sublineage of E. coli, we must use the E. coli MLST scheme.
We will also use the resistance genes database provided with srst2, resistance.fasta
(modified from ResFinder).
1. Download the files for the E. coli MLST scheme, using the script provided:
getmlst.py --species "Escherichia coli#1"
For SRST2, remember to check what separator is being used in this allele database
Looks like --mlst_delimiter '-'
>adk-1 --> --> ('adk', '-', '1')
Suggested srst2 command for use with this MLST database:
srst2 --output test --input_pe *.fastq.gz --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --mlst_delimiter '-'
Note, this is correctly guessing that we should use the default --mlst_delimiter '-' with
this database. The log file will tell you exactly what files were downloaded.
Check that the allele sequences have been downloaded and compiled into a single fasta file:
Escherichia_coli#1.fasta
Check that the ST definitions have been downloaded:
ecoli.txt
2. Download Illumina paired end read sets from the ENA. Note if you have limited download
capacity or time to spare, you could download the first two files only (total ~30MB) for
basic testing of srst2.
# strain 20081885 from ERP000182 (14M, 13M; 51M, 44M)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024070/ERR024070_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024070/ERR024070_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028678/ERR028678_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028678/ERR028678_2.fastq.gz
# strain 20031275 from ERP000182 (89M, 94M; 88M, 74M)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024082/ERR024082_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024082/ERR024082_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028690/ERR028690_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028690/ERR028690_2.fastq.gz
# strain IB694 from ERP000182 (80M, 64M)
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024619/ERR024619_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR024/ERR024619/ERR024619_2.fastq.gz
3. Try running MLST and gene detection against a tiny read set with low coverage (~3x):
srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta
Note that, as the databases have not been indexed for bowtie2 yet, that srst2 has to run
bowtie2-build which spits out a lot of messages to stdout. This is not a problem!
Bowtie2 also reports its mapping stats, it should indicate that there were only
400,882 reads in this data set, of which 0.02% mapped to our MLST/resistance loci.
This probably won't be enough to get good quality information.
Check the log file (shigella1.log) to see what happened.
Outputs are printed to:
shigella1__mlst__Escherichia_coli#1__results.txt - mlst results only
shigella1__fullgenes__ARGannot__results.txt - resistance results, one line per gene
shigella1__genes__ARGannot__results.txt - resistance results only, tabulated
shigella1__compiledResults.txt - MLST and resistance genes, tabulated
shigella1__ERR024070.Escherichia_coli#1.scores - full score & alignment info for all MLST alleles
shigella1__ERR024070.ARGannot.scores - full score & alignment info for resistance genes with >90% coverage
Because we ran with the --save_scores flag, we have also got a scores file for each database,
which details the score and mapping information for every allele in that database.
The ST was not called correctly because depth was too low (average read depth 2.7; this
is printed in the MLST result table.
2 resistance genes were detected with >90% coverage (the default threshold), which look
like variants of ampC and strB (marked with * to indicate they were not precise matches,
and ? to indicate uncertainty because some bases weren't covered).
This shows us that this level of data (which was considered a failed sequencing run) isn't
enough to get good results. Luckily this genome was resequenced; see what happens when you analyse
that data set.
4. Try running MLST and gene detection against a proper read set, and compile together with
the results from the poor read set for comparison.
srst2 --input_pe ERR028678*.fastq.gz --output shigella2 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta --prev_output shigella1__compiledResults.txt
Check the log file to see what happened.
Outputs are printed to:
shigella2__mlst__Escherichia_coli#1__results.txt - mlst results only
shigella2__fullgenes__ARGannot__results.txt - resistance results, one line per gene
shigella2__genes__ARGannot__results.txt - resistance results only, tabulated
shigella2__compiledResults.txt - MLST and resistance genes, tabulated
This time there is an ST called, which should look like this:
Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth maxMAF
ERR028678 152? 11? 63 7? 1 14 7 7 0 adk-11/edge2.0;gyrB-7/edge2.0 13.9595714286 0.125
Note that the ST has a question mark (?) after it, indicating that we have some uncertainty about the call.
The table shows this uncertainty comes from two loci, adk and gyrB, which are themselves labelled with question marks.
The 'uncertainty' column indicates that the problem is a drop-off in depth at the edge of these loci, down to 2x, which is the default minimum to report a potential uncertainty (--min_edge_depth 2).
There are also lots of resistance genes identified.
Because we supplied a prior results file, shigella2__compiledResults.txt contains the new
results as well as the old results.
5. Now try running on the other three read sets:
srst2 --input_pe ERR024082*.fastq.gz ERR028690*.fastq.gz ERR024619*.fastq.gz --output shigella3 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta
6. Now compile the results from all 5 read sets:
srst2 --output all --prev_output shigella2__compiledResults.txt shigella3__compiledResults.txt
This outputs a file, all__compiledResults.txt, containing the compilation of all results
for MLST and resistance genes across the 5 strains.