You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have 20 de novo hybrid genome assemblies (ONT plus Illumina; flye plus pylon polishing) of different strains of the same species.
For an initial genome reference, we also have a high quality (T2T, full chromosome) assembly of a closely related, sister species.
My analysis process at the moment is:
Ragout refine each of my 20 assemblies vs the T2TReference
Rank my ragout scaffolds from best to worst (including consideration of how much of the initial assembly was unplaced)
Iteratively ragout refine each of the remaining 19 assemblies vs the T2TReference PLUS my best assembly's ragout refined scaffolds (i.e. two reference fastas).
Does this sound reasonable, or might this result in the overfitting of the remaining 19 assemblies to that one, best assembly?
While I don't have reason to think that any of my 20 strains should have massive differences between them, I don't want to obscure any smaller differences but over-favoring the assembly that just happed to have the best quality/coverage.
The text was updated successfully, but these errors were encountered:
I think the simpler strategy of just using T2T reference for each strain may be sufficient. How much structural variance do you expect between the strains? Otherwise I think it makes sense, but definitely makes sense to compare it against the simple "baseline" approach with one reference!
The problem with the simplest strategy it that using only the T2T genome for each strain results in a very mixed bag of results. Some of the 20 strains scaffold out nicely (i.e. I get 1-to-1 scaffolds for each of the T2T chromosomes) and others do not (I attribute this behavior to the 0.095 substitutions per site distance between the T2T sister species and each of the 20 strain assemblies). This issue can be remedied by adding back in the best of my scaffolded assemblies (so T2T strain plus best "in-species" assembly).
So I've decided to go with this approach moving ahead
I have 20 de novo hybrid genome assemblies (ONT plus Illumina; flye plus pylon polishing) of different strains of the same species.
For an initial genome reference, we also have a high quality (T2T, full chromosome) assembly of a closely related, sister species.
My analysis process at the moment is:
Does this sound reasonable, or might this result in the overfitting of the remaining 19 assemblies to that one, best assembly?
While I don't have reason to think that any of my 20 strains should have massive differences between them, I don't want to obscure any smaller differences but over-favoring the assembly that just happed to have the best quality/coverage.
The text was updated successfully, but these errors were encountered: