-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate other assemblers #665
Comments
Here are some more interesting samples to experiment with:
|
At first glance, a good candidate to substitute IVA seemed to be Savage (especially combined with Virus-VG). It was created specifically with the challenges of viral assembly in mind, and takes into account that a virus typically consists of several strains or quasispecies. Savage performs a de novo assembly to create contigs and Virus-VG assembles the contigs into complete genomes, estimating the ratio of the different variants. |
Here's another problem sample: 2929-P14-Y02944-10D-NFLHIVDNA_S8 from the 19-Feb-2021.M05995 run. I ran it twice and got different contigs. Both times assembled an unknown contig with low coverage plus some HIV contigs. The first time produced a single HIV with a huge deletion from the start of gag to partway through GP41, and the second time produced two contigs with an overlapping region that combine to match the first time. Unrepeatable results could be a big problem, so we should try to avoid them. I think we had noticed problems when we first started using IVA, but I added an option to set the random number seed, and I reduced the number of threads to two. It's possible we might have to reduce the threads to one. Please try rerunning the analysis on this sample a few times to see if you can reproduce the problem. |
@CBeelen reran sample 2929-P14-Y02944-10D-NFLHIVDNA_S8 541 times and it assembled two contigs every time. I'm going to review the results in Kive to make sure they really came from the exact same version of the pipeline. If so, I'll rerun the singularity container a bunch of times to see if I can reproduce the problem. |
We have decided to give SPAdes a try as de novo assembler, as it performs well in benchmark tests and is commonly used. The Github branch of MiCall with SPAdes can be found here. I experimented with a few different options for SPAdes ( SPAdes has two output files: contigs.fasta and scaffolds.fasta. They recommend to use the scaffolds as output, so that is what I did (they are occasionally longer and fewer than the contigs). Fragmented assemblies turned out to be a large problem. The output frequently consisted of several contigs that overlapped with identical sequences, but were not merged by SPAdes. Thus, I decided to add a few parts of the IVA pipeline to attempt to merge the contigs that SPAdes outputs. Specifically, I used Interestingly, many of the contigs that SPAdes does not merge but that do get merged by the post-processing have an overlap of exactly 127 bases, the largest kmer size that SPAdes works with. Specifically, the "nicely assembling" samples I investigated were:
And the "difficult assembly" samples were:
In conclusion, it looks like IVA is still performing better than SPAdes in the current configuration. There are two options worth exploring next, apart from SPAdes:
|
The next assembler I have given a try is Abyss. Abyss works very well on the HIV samples I have tested (see above), with comparable results to IVA: long contigs, mostly one or two, spanning the whole genome. However, for the HCV samples, it had more trouble. It assembled a large amount of contigs (>10,000) in its standard mode, but fewer in bloom filter mode. The higher k (kmer size) is set, the fewer contigs were assembled, and the longer they got. I got down to a few hundreds to tens of contigs with k as high as 235. However, this also led to parts of the genome getting lost, especially if they have lower coverage. Next, I will pursue two different strategies in parallel, see To Do list in the first comment: using ntJoin and creating a pessimistic mode for IVA.
|
While creating a pessimistic mode for IVA, I happened to create a microtest example that showcases or triggers IVA's non-deterministic behaviour (this is 2240-HIVHCV-mixtures_S32 on the Abyss branch of Micall). I wanted to create a sample that would cause IVA to struggle - specifically, the sample consists a mixture of reads from two HIV sequences and one HCV sequence (the HIV sequences don't overlap). Half of the reads for each sequence have a mutation at a specific site. IVA then struggles to assemble the contigs beyond this mutation, because the two variants are present on equal levels. When testing the IVA output for this sample, I noticed that it is non-deterministic: when re-running the sample, different numbers of contigs get assembled. The contigs may break at the mutation site, or there can be a misassembly, or a contig may be (partly) missing. |
In a strategy discussion today, @CBeelen and I agreed on trying out a technique that will assemble one contig at a time, use BLAST search to decide if the contig matches any of the reference sequences, and discard any unrecognized contigs, along with any reads that map to them. The hope is that really noisy samples like random primers will stop spending so much effort trying to extend the contigs we don't care about. It's possible we could then put all the contigs back into the mix when we finally map all the reads onto the contigs. As a separate idea, I looked for other de novo assemblers and found Velvet. It sounds like some of the other tools we tried, based on de Bruijn graphs. I added an item for trying Velvet to this issue's task list. |
I just discovered a fix in IVA that isn't in our fork. It now supports samtools versions 1.10 and above. |
Since the last update, I have tried out a few different things: First, I tried to scaffold the assembly results of Abyss using ntJoin. However, this was not very successful: ntJoin was not able to improve the assembly by much. Abyss often outputs multiple contigs covering the same genome region, probably because there was no clear resolution for mixtures / variants. ntJoin was unable to resolve most of these issues, probably because it relies on the same method of assembly (de Bruijn graphs) as Abyss itself. Next, I created a pessimistic mode for IVA which only tries to extend one contain at a time. The idea was that this would prevent IVA from trying to extend the old contigs again and again, often without success in case it hit a mixture of approximately 50:50 ratio. This does work, but it does not save a lot of time. I found out (using a runtime analysis) that IVA spends most of its time mapping all the reads to the existing contig(s) - it uses a filtered set of reads that don't map to any of the old contigs most of the time, but maps all of the reads to the contig(s) every once in a while. Especially for the random primer samples, this mapping step takes a long time. In the mean time, however, I also tried out Velvet, which happily surprised us by being very successful at assembling our samples. The main advantages are that it assembles very quickly (on the order of a few seconds to minutes) and is well-documented, with good explanations for the parameters and an extensive manual. |
I saw mention of another assembler that sounds interesting: Haploflow. |
I am currently experimenting with Haploflow, which is similarly fast to Velvet and produces promising results. Here are some interesting resources for further reading on Haploflow: the publication and the Github repo containing the scripts for the supplementary material. The latter also contains a script for scaffolding the assembly given a reference sequence (which could also be the consensus sequence of the assembly), which I will try with our data as well. |
Trying to automate Velvet's input parameters was not very successful: the idea was to attempt an initial assembly and determine the coverage cutoff as well as the expected coverage based on the coverage histogram of the resulting contigs, weighted by their length. However, the coverage histograms did not really show a clear dip in the coverage between short, low-coverage and long, high-coverage contigs (which is where the cutoff should be placed). Therefore, it was impossible to determine the cutoff automatically from coverage histograms. |
Also, for future reference: another potentially interesting assembler is Megahit, which appears to be very fast and does not require much memory. It might also struggle with multiple strains though, as its strategy is to terminate a contig at an ambiguous point. |
Using Haploflow in combination with IVA's functions for merging contigs post-assembly yields some pretty good results. For most non-RP samples, the results look similarly good as IVA's results, except for being sometimes distributed in multiple contigs. I tried scaffolding the assembly and picked RagTag. It has different modes for scaffolding (i.e. ordering the contigs and filling gaps with "N") and patching (i.e. ordering the contigs and filling gaps with reference genome). However, it was not straightforward to get it to actually improve the assemblies, and it seems like it would require some more careful setting of the different parameters to get it to do something reliably. I am leaving it on the To Do list for now, because I still think that it would be interesting to invest some more time in. For the random primer samples, I implemented a mode which runs a first attempt at assembly and then separates the reads according to their usefulness. Reads that mapped to contigs that are a blast match to the references are considered useful, while reads that map to contigs that were no blast matches to the references are discarded. The reads that did not map to anything are kept as useful as well, because they might be necessary to fill gaps between contigs. Then, a second attempt at assembly is performed with only the useful reads. Other than that, we now want to do a more quantitative comparison of IVA and Haploflow. For that, I will do a more quantitative runtime analysis, and a quantitative comparison of the assembly results. I will look into using Quast to compare the quality of the assemblies. |
We've had reasonably good results with IVA, but it has a few problems:
A brief search for alternatives turned up SPAdes, as mentioned in an article by Sutton, as well as tadpole, mentioned in a discussion forum.
An interesting sample for experimenting on is D62201-HCV_S3 from the 26 Feb 2016.M04401 run. It looks like a mixed HCV infection with two or possibly three strains to assemble.
Edit: To Do List
The text was updated successfully, but these errors were encountered: