Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusual left shoulder bump in kmer spectra #143

Open
adeoliveira86 opened this issue Oct 10, 2024 · 7 comments
Open

Unusual left shoulder bump in kmer spectra #143

adeoliveira86 opened this issue Oct 10, 2024 · 7 comments

Comments

@adeoliveira86
Copy link

adeoliveira86 commented Oct 10, 2024

Hello,

I work with small marine animals and I have generated five single HiFi libraries from the same species. To avoid issues with the assemblies (heterozigosity, different haplotypes), I decided not to pool the data for sequence reconstruction, but rather, assemble then individually and scaffold and merge the purged primary assemblies later on with RagTag scaffold and merge options (https://github.com/malonge/RagTag). My rationale for doing this was to preserve most of the partially phased blocks in the primary assembly and have a more contiguous assembly. This approach was successful, since the BUSCOs, N50, and merqury assessment (QV, completeness) for the final scaffolded and merged assembly, are better any individual primary draft. The only issue I have been facing is an unexpected left bump on my merqury plot on the final assembly. Following, there are three spectra cn.plots:

  1. Best individual assembly unpurged (using the same HiFi library used in the assembling stage for meryl)
    image

  2. Best individual assembly purged (using the same HiFi library used in the assembling stage for meryl)
    image

  3. Scaffolded and merged purged assembly (using all the five HiFi libraries for meryl)
    image

As can be noted, after the scaffolding step and using all five HiFi libraries in the meryl step, a left shoulder bump with low kmer multiplicity and count was introduced. So ask myself if this is a reflection of the scaffolding step? Or the combination of all the HiFi libraries in the meryl step for merqury assessment?

I tested the first hypothesis by generating merqury plots for every individual scaffolding step before merging:

  1. Reference assembly (my best assembly) vs assembly#1 -> left bump present
  2. Reference assembly (my best assembly) vs assembly#2 -> left bump present
  3. Reference assembly (my best assembly) vs assembly#3 -> left bump present
  4. Reference assembly (my best assembly) vs assembly#4 -> left bump present

and they all have this similar kmer spectra with the bump. After, I tested the hypothesis that the combined HiFi libraries used in meryl was the cause of this issue (maybe it is not an issue?). So I ran merqury with the best individual assembly I have using all the five HiFi libraries for meryl. To my surprise, the merqury plot using the non-scaffolded-merged assembly is quite similar to the scaffolded-merged ones.

not_scaffolded

This shows me that the bump on the left is probably caused by the combined HiFi libraries used in the meryl stage (maybe contamination?). I can see that the amount of unique kmers in the assembly only is quite low, so this is a good thing, indicating there are no major issues with the scaffolded merged version of the genome. The only reason I am sharing this, is because I have never encountered a kmer plot like this in github and publications, and I would like to double check if I am missing something here and properly understanding the results.

I am slowly trying to figure out the issue and updating the thread (maybe it will be useful for someone facing similar issues). Following the issue #90, I used the command lines proposed by @arangrhie:

echo -e "Copies\tkmer_multiplicity\tCount" > test.hist
awk '{print "read\t"$0}' read.hist >> test.hist 
Rscript $MERQURY/plot/plot_spectra_cn.R -f test.hist -o test.hist -t line

and I generated GS and read only plots based on the histogram file (they are below):

read histogram:
test hist ln

genomescope2:
gs_linear_plot

Based on the read and genomescope2 plots, it looks like I have plenty of erroneous kmers generated by the concatenation of all HiFi libraries into a single one for meryl. Most likely by being more agressive and filtering low-frequent kmer will be the solution and to mitigate this left should bump in the plot. I will try this next week and upload the results.

I (still) appreciate any input,
Best!

@adeoliveira86 adeoliveira86 changed the title Bump in kmer spectra after scaffolding step Unusual left shoulder bump in kmer spectra Oct 11, 2024
@arangrhie
Copy link
Contributor

Hello @adeoliveira86,

It looks like there is a 1-copy peak hiding around ~30x in your combined read kmer histogram, and around ~5x in your first plot.
Trying to understand the background - you have 5 different individuals, sequenced on each SMRT cell? Are they genetically related (e.g. siblings)?
And what is the purpose of your assembly? To create a linear reference of this species?

@adeoliveira86
Copy link
Author

Hello @arangrhie,

Thanks for your answer. Sorry, I should have given more context to the problem. To summarise everything, I work with a tiny non-model marine organism and my goal is to generate the most contiguous haploid genome reference possible. The genome, as you could see from the Genomescope2 plot, is relatively big and it is full of repeats (~70%). The main issue is we cannot get a lot of HMW-DNA from it and previous attempts to use ultra-low input PacBio libraries failed badly. The only choice was to use low-input HiFi libraries and avoid the PCR amplification step. However, since we get ~400ng of HMW-DNA per species, I barely reach 13x coverage for assembling.

Because all these limitations, I decided to sequence five distinct individuals from the same species (not siblings) freshly collected from the field, and use hifiasm for assembling. All the five hifiasm drafts have good BUSCO completeness (~95% identified BUSCOS) and QV scores measured by mequry. However, because the high repeat content they are really fragmented (between 2K contigs and N50 ~800kb), so to increase contiguity, I decided to use RagTag for scaffolding and merging. This is the back story that proceeds my thread.

I am happy with the results with the scaffolded merged assembly, but I am stuck at this left should bump in the plot, which is confusing me and making me wonder if there is anything wrong.

@arangrhie
Copy link
Contributor

I see. It looks like the weird bump on the left shoulder is coming from the 1-copy region picked up from one of the samples (your '1st' assembly, for example). It's having a non-typical distribution because some of the kmers are also found in other samples in uneven coverage.
I'd highly recommend to do a read coverage based analysis to ensure the structural accuracy. You could lower your expectation and set to have at least 5x of reads supporting the assembly in the heterozygous region.

However, to be honest with you, I'd encourage you to make one assembly with all the reads - A genome of size ~1.3G is not that big after all :)
Given the kmer profile, it looks like it has a clear 2-copy peak, which means you'll get some extra coverage boost in that region, increasing the chance to make a more contiguous assembly. Your current assembly is lacking a fraction of this region (read-only around 60x), possibly due to the coverage lost, resulting in fragmented contigs.
In addition, you may get the most commonly shared 1-copy region assembled along with the 2-copy region, increasing the chance to get a more contiguous assembly. I believe your QV (and BUSCO) will be improved.

Best,
Arang

@adeoliveira86
Copy link
Author

Thanks for the quick feedback. I will implement your suggestions asap. I will close this thread now, and re-open when I have new results to share and hopefully better results. I really appreciate your time and help. I get back to you soon!

@adeoliveira86
Copy link
Author

Hello @arangrhie,

Based on this specific suggestion:

However, to be honest with you, I'd encourage you to make one assembly with all the reads - A genome of size ~1.3G is not that big after all :)

I ran hifiasm with the combined 5 HiFi libraries obtained from the same species, however, different individuals. The results are not great. The primary genome size is ~3.5x bigger than the expected length due the different haplotig accumulation. After the assembly, I ran multiple purging rounds with purge_dups, even though this procedure seems not to be advisable (see here dfguan/purge_dups#50), but employed in similar publication (https://pmc.ncbi.nlm.nih.gov/articles/PMC10928254/). After three purging rounds, I could not get comparable results with either the hifiasm ran using a single library or the post-scaffolding step that I did (many duplicates BUSCO still and more missing). I guess the best strategy would be still assembly the libraries separately and then scaffold them in the end. Or maybe you have other ideas?

One think that I did not systematically is to assess the structural accuracy my final assembly, as you suggested:

I'd highly recommend to do a read coverage based analysis to ensure the structural accuracy. You could lower your expectation and set to have at least 5x of reads supporting the assembly in the heterozygous region.

I will perform this step now to make sure that the scaffolded-merging step did not join unrelated contigs. Do you think it is advisable to do this step as well with the individual assemblies before the scaffolding step? I know that what is being merged during this step is mainly non-coding repetitive regions, since the BUSCO scores of the individual and merge assembly are quite comparable. I do not have any Hi-C data so doing a self alignment of the genome using mummer and mapping the reads and checking for abnormal coverage islands are my idea of how to further assess the assembly integrity. Would you have any other suggestion?

But still, even after all of this, there is the weird left bump on the merqury plot, that based on your input is a result of non-typical kmer distribution due uneven coverage in the different libraries. I am guessing this is not going away and is a reflection of my assembly pipeline. Is there another way to look into this matter?

This is it for the week and thanks again for your help.

@adeoliveira86 adeoliveira86 reopened this Oct 18, 2024
@arangrhie
Copy link
Contributor

Hello @adeoliveira86,

The primary genome size is ~3.5x bigger than the expected length

This is expected, as you'd have a bag of genomes :) Don't panic!
How was the continuity and QV? I was expecting the continuity to increase compared to your 1-library assemblies and with an increase of QV in overall, especially for the longer contigs. One of the Merqury output has the QV for each contig, sort them by the length of the contig and compare that to your previous best assembly.

If the new one wins, I'd try a self-alignment of the assembly or align it to a known reference (the one you used in RagTag?) using something like MashMap. This allows you to determine which segments were assembled multiple times. (I'd probably use some more stringent filtering options.. like -pi 95 and increase the -s parameter depending on the contig size distribution.) This step can help doing a manual purge-dup of the larger pieces. Purge_dup will try this in an automated fashion, however your bag of genomes breaks a lot of the assumptions (even coverage, diploid, ...). If I remember correctly, one of the output contains the list of contigs that were purged. You could see how they align up to with the rest of the contigs, and decide if you agree / disagree. I'll rely on the remaining contigs after confirming that I'm not loosing much, and purged things that should be purged - and name it 'primary' assembly. Run BUSCO at this step on the primary assembly.

For structural validation, you need the purged contigs, especially the larger pieces (>50k-100k?). Hifiasm has some known issues that it creates contigs as small as a hifi read, which could be problematic. "Discard" the tiny contigs (as aopposed to purge them). Put the remaining purged contigs in a dummy (such as alt.fa), so you can distinguish them from the primary. Try run Merqury on the pri.fa and alt.fa, just as you'd run it for diploid assemblies. A more fine-grained alignment between the pri vs alt, could be done here with Mummer.

The pri.fa spectra-cn plot should look similar to your previous spectra-cn plots (Merqury produces spectra-cn plots for diploid and each fa assembly). Confirm that everything looks right - then I'd proceed with RagTag or whatever scaffolding on the primary assembly.

I will perform this step now to make sure that the scaffolded-merging step did not join unrelated contigs. Do you think it is advisable to do this step as well with the individual assemblies before the scaffolding step? I know that what is being merged during this step is mainly non-coding repetitive regions, since the BUSCO scores of the individual and merge assembly are quite comparable. I do not have any Hi-C data so doing a self alignment of the genome using mummer and mapping the reads and checking for abnormal coverage islands are my idea of how to further assess the assembly integrity. Would you have any other suggestion?

Read-alignment based validation comes after the scaffolding. Include both the pri + alt in the reference before aligning the reads, so your coverage signals don't get mislead. I'd recommend using my coverage analysis script, from winnowmap alignments of the hifi reads.

Feel free to share your spectra-cn plots and other asm-asm alignment plots in case you need more feedback.

But still, even after all of this, there is the weird left bump on the merqury plot, that based on your input is a result of non-typical kmer distribution due uneven coverage in the different libraries. I am guessing this is not going away and is a reflection of my assembly pipeline. Is there another way to look into this matter?

Yes, the shoulder will still show up. I wouldn't worry too much about it. It is a reflection of your 'bag of genomes'. Using the coverage analysis results, you could actually distinguish which contigs / segments are shared in all your 4 genomes vs. in 1, 2, or 3 genomes.

Best,
Arang

@adeoliveira86
Copy link
Author

Hello @arangrhie,

Thanks a lot for the really detailed answer. Concerning the first part of your suggestions:

This is expected, as you'd have a bag of genomes :) Don't panic! How was the continuity and QV? I was expecting the continuity to increase compared to your 1-library assemblies and with an increase of QV in overall, especially for the longer contigs. One of the Merqury output has the QV for each contig, sort them by the length of the contig and compare that to your previous best assembly.

Yes, I panicked a bit to be fair :) However, below are the results. I ran merqury (meryl set to k=31) with the new pooled HiFi libraries and the my current genomic draft, below are the stats:

Current RagTag (completeness)/QV: 87.9287/54.8676
Pooled (Completeness/QV): 98.6972/48.0304

As you can notice the completeness is way better in the Pooled HiFi, but not the overall quality. However, as you suggested I looked in the individual contig qv of the pooled and my current assembly, below you can find density distribution of the qv for the two assemblies:

qv_plot

By looking to these plots I could see that the quality of the pooled is slightly better than the draft I have right now (despite the overall QV being lower (why?). The variability is greater (even with contigs tagged as 0 qv), but I reckon this is expected since this is a unpurged "bag of genomes" which was not processed in any way. I looked into the spectra cs plot from the pooled assembly (below) and it clearly shows me two distinct peaks: one composed of low-multiplicity kmers and a second one around 50-60 which can also be found in my current draft genome.

pooled_li_merged spectra-asm fl

The first peak, based on my interpretation, correspond to unique erroneous sequences that need to be removed from the assembly. Considering the results I just showed, I would say that the new pooled assembly has more completeness and, based on the QV plots I showed, has similar QVs as the current draft. Maybe it is worth further processing the data to check if I can see any improvements, which brought me to the second part of your answer:

f the new one wins, I'd try a self-alignment of the assembly or align it to a known reference (the one you used in RagTag?) using something like MashMap. This allows you to determine which segments were assembled multiple times. (I'd probably use some more stringent filtering options.. like -pi 95 and increase the -s parameter depending on the contig size distribution.) This step can help doing a manual purge-dup of the larger pieces.

I performed the self-alignment of the pooled assembly against itself with the parameters you suggested using MashMap, however, for the next step of the pipeline (manual purging of the data based on the results), I got a bit stuck. I initially wrote a script that looked into the .paf to identify aligned subjects against a query and remove them. However, I was encountering complex scenarios that was making the task quite difficult to deal with (e.g., aligned contig regions within the contigs and not in the ends, partially aligned subjects against queries). Additionally, since the organism I am working has a repeat content of ~70% I am not confident of which aligned blocks (i.e., contigs) to remove and which ones to keep. Based on your answer, using this pooled assembly straight into the purge_dups would break many assumptions of the software and I could already see that even after multiple purging rounds I did not get to the expected genome size. That was one of the main reasons I decided to assemble the HiFi libraries separately and then scaffold then later. Any tips of how to proceed from here?

I will apply your read-alignment based validation method to the current draft that I have and check the results. Thanks a lot for sharing it with me. Meanwhile, I will keep trying to work with the pooled assembly since the initial completeness and QV results look promising.

Best,
André

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants