-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unusual left shoulder bump in kmer spectra #143
Comments
Hello @adeoliveira86, It looks like there is a 1-copy peak hiding around ~30x in your combined read kmer histogram, and around ~5x in your first plot. |
Hello @arangrhie, Thanks for your answer. Sorry, I should have given more context to the problem. To summarise everything, I work with a tiny non-model marine organism and my goal is to generate the most contiguous haploid genome reference possible. The genome, as you could see from the Genomescope2 plot, is relatively big and it is full of repeats (~70%). The main issue is we cannot get a lot of HMW-DNA from it and previous attempts to use ultra-low input PacBio libraries failed badly. The only choice was to use low-input HiFi libraries and avoid the PCR amplification step. However, since we get ~400ng of HMW-DNA per species, I barely reach 13x coverage for assembling. Because all these limitations, I decided to sequence five distinct individuals from the same species (not siblings) freshly collected from the field, and use hifiasm for assembling. All the five hifiasm drafts have good BUSCO completeness (~95% identified BUSCOS) and QV scores measured by mequry. However, because the high repeat content they are really fragmented (between 2K contigs and N50 ~800kb), so to increase contiguity, I decided to use RagTag for scaffolding and merging. This is the back story that proceeds my thread. I am happy with the results with the scaffolded merged assembly, but I am stuck at this left should bump in the plot, which is confusing me and making me wonder if there is anything wrong. |
I see. It looks like the weird bump on the left shoulder is coming from the 1-copy region picked up from one of the samples (your '1st' assembly, for example). It's having a non-typical distribution because some of the kmers are also found in other samples in uneven coverage. However, to be honest with you, I'd encourage you to make one assembly with all the reads - A genome of size ~1.3G is not that big after all :) Best, |
Thanks for the quick feedback. I will implement your suggestions asap. I will close this thread now, and re-open when I have new results to share and hopefully better results. I really appreciate your time and help. I get back to you soon! |
Hello @arangrhie, Based on this specific suggestion:
I ran hifiasm with the combined 5 HiFi libraries obtained from the same species, however, different individuals. The results are not great. The primary genome size is ~3.5x bigger than the expected length due the different haplotig accumulation. After the assembly, I ran multiple purging rounds with purge_dups, even though this procedure seems not to be advisable (see here dfguan/purge_dups#50), but employed in similar publication (https://pmc.ncbi.nlm.nih.gov/articles/PMC10928254/). After three purging rounds, I could not get comparable results with either the hifiasm ran using a single library or the post-scaffolding step that I did (many duplicates BUSCO still and more missing). I guess the best strategy would be still assembly the libraries separately and then scaffold them in the end. Or maybe you have other ideas? One think that I did not systematically is to assess the structural accuracy my final assembly, as you suggested:
I will perform this step now to make sure that the scaffolded-merging step did not join unrelated contigs. Do you think it is advisable to do this step as well with the individual assemblies before the scaffolding step? I know that what is being merged during this step is mainly non-coding repetitive regions, since the BUSCO scores of the individual and merge assembly are quite comparable. I do not have any Hi-C data so doing a self alignment of the genome using mummer and mapping the reads and checking for abnormal coverage islands are my idea of how to further assess the assembly integrity. Would you have any other suggestion? But still, even after all of this, there is the weird left bump on the merqury plot, that based on your input is a result of non-typical kmer distribution due uneven coverage in the different libraries. I am guessing this is not going away and is a reflection of my assembly pipeline. Is there another way to look into this matter? This is it for the week and thanks again for your help. |
Hello @adeoliveira86,
This is expected, as you'd have a bag of genomes :) Don't panic! If the new one wins, I'd try a self-alignment of the assembly or align it to a known reference (the one you used in RagTag?) using something like MashMap. This allows you to determine which segments were assembled multiple times. (I'd probably use some more stringent filtering options.. like For structural validation, you need the purged contigs, especially the larger pieces (>50k-100k?). Hifiasm has some known issues that it creates contigs as small as a hifi read, which could be problematic. "Discard" the tiny contigs (as aopposed to purge them). Put the remaining purged contigs in a dummy (such as alt.fa), so you can distinguish them from the primary. Try run Merqury on the pri.fa and alt.fa, just as you'd run it for diploid assemblies. A more fine-grained alignment between the pri vs alt, could be done here with Mummer. The pri.fa spectra-cn plot should look similar to your previous spectra-cn plots (Merqury produces spectra-cn plots for diploid and each fa assembly). Confirm that everything looks right - then I'd proceed with RagTag or whatever scaffolding on the primary assembly.
Read-alignment based validation comes after the scaffolding. Include both the pri + alt in the reference before aligning the reads, so your coverage signals don't get mislead. I'd recommend using my coverage analysis script, from winnowmap alignments of the hifi reads. Feel free to share your spectra-cn plots and other asm-asm alignment plots in case you need more feedback.
Yes, the shoulder will still show up. I wouldn't worry too much about it. It is a reflection of your 'bag of genomes'. Using the coverage analysis results, you could actually distinguish which contigs / segments are shared in all your 4 genomes vs. in 1, 2, or 3 genomes. Best, |
Hello @arangrhie, Thanks a lot for the really detailed answer. Concerning the first part of your suggestions:
Yes, I panicked a bit to be fair :) However, below are the results. I ran merqury (meryl set to k=31) with the new pooled HiFi libraries and the my current genomic draft, below are the stats: Current RagTag (completeness)/QV: 87.9287/54.8676 As you can notice the completeness is way better in the Pooled HiFi, but not the overall quality. However, as you suggested I looked in the individual contig qv of the pooled and my current assembly, below you can find density distribution of the qv for the two assemblies: By looking to these plots I could see that the quality of the pooled is slightly better than the draft I have right now (despite the overall QV being lower (why?). The variability is greater (even with contigs tagged as 0 qv), but I reckon this is expected since this is a unpurged "bag of genomes" which was not processed in any way. I looked into the spectra cs plot from the pooled assembly (below) and it clearly shows me two distinct peaks: one composed of low-multiplicity kmers and a second one around 50-60 which can also be found in my current draft genome. The first peak, based on my interpretation, correspond to unique erroneous sequences that need to be removed from the assembly. Considering the results I just showed, I would say that the new pooled assembly has more completeness and, based on the QV plots I showed, has similar QVs as the current draft. Maybe it is worth further processing the data to check if I can see any improvements, which brought me to the second part of your answer:
I performed the self-alignment of the pooled assembly against itself with the parameters you suggested using MashMap, however, for the next step of the pipeline (manual purging of the data based on the results), I got a bit stuck. I initially wrote a script that looked into the .paf to identify aligned subjects against a query and remove them. However, I was encountering complex scenarios that was making the task quite difficult to deal with (e.g., aligned contig regions within the contigs and not in the ends, partially aligned subjects against queries). Additionally, since the organism I am working has a repeat content of ~70% I am not confident of which aligned blocks (i.e., contigs) to remove and which ones to keep. Based on your answer, using this pooled assembly straight into the purge_dups would break many assumptions of the software and I could already see that even after multiple purging rounds I did not get to the expected genome size. That was one of the main reasons I decided to assemble the HiFi libraries separately and then scaffold then later. Any tips of how to proceed from here? I will apply your read-alignment based validation method to the current draft that I have and check the results. Thanks a lot for sharing it with me. Meanwhile, I will keep trying to work with the pooled assembly since the initial completeness and QV results look promising. Best, |
Hello,
I work with small marine animals and I have generated five single HiFi libraries from the same species. To avoid issues with the assemblies (heterozigosity, different haplotypes), I decided not to pool the data for sequence reconstruction, but rather, assemble then individually and scaffold and merge the purged primary assemblies later on with RagTag scaffold and merge options (https://github.com/malonge/RagTag). My rationale for doing this was to preserve most of the partially phased blocks in the primary assembly and have a more contiguous assembly. This approach was successful, since the BUSCOs, N50, and merqury assessment (QV, completeness) for the final scaffolded and merged assembly, are better any individual primary draft. The only issue I have been facing is an unexpected left bump on my merqury plot on the final assembly. Following, there are three spectra cn.plots:
Best individual assembly unpurged (using the same HiFi library used in the assembling stage for meryl)
Best individual assembly purged (using the same HiFi library used in the assembling stage for meryl)
Scaffolded and merged purged assembly (using all the five HiFi libraries for meryl)
As can be noted, after the scaffolding step and using all five HiFi libraries in the meryl step, a left shoulder bump with low kmer multiplicity and count was introduced. So ask myself if this is a reflection of the scaffolding step? Or the combination of all the HiFi libraries in the meryl step for merqury assessment?
I tested the first hypothesis by generating merqury plots for every individual scaffolding step before merging:
and they all have this similar kmer spectra with the bump. After, I tested the hypothesis that the combined HiFi libraries used in meryl was the cause of this issue (maybe it is not an issue?). So I ran merqury with the best individual assembly I have using all the five HiFi libraries for meryl. To my surprise, the merqury plot using the non-scaffolded-merged assembly is quite similar to the scaffolded-merged ones.
This shows me that the bump on the left is probably caused by the combined HiFi libraries used in the meryl stage (maybe contamination?). I can see that the amount of unique kmers in the assembly only is quite low, so this is a good thing, indicating there are no major issues with the scaffolded merged version of the genome. The only reason I am sharing this, is because I have never encountered a kmer plot like this in github and publications, and I would like to double check if I am missing something here and properly understanding the results.
I am slowly trying to figure out the issue and updating the thread (maybe it will be useful for someone facing similar issues). Following the issue #90, I used the command lines proposed by @arangrhie:
and I generated GS and read only plots based on the histogram file (they are below):
read histogram:
genomescope2:
Based on the read and genomescope2 plots, it looks like I have plenty of erroneous kmers generated by the concatenation of all HiFi libraries into a single one for meryl. Most likely by being more agressive and filtering low-frequent kmer will be the solution and to mitigate this left should bump in the plot. I will try this next week and upload the results.
I (still) appreciate any input,
Best!
The text was updated successfully, but these errors were encountered: