Best practices question--is "overfiting" to the reference sequence a problem? #82

DaRinker · 2023-04-24T20:02:21Z

I have 20 de novo hybrid genome assemblies (ONT plus Illumina; flye plus pylon polishing) of different strains of the same species.
For an initial genome reference, we also have a high quality (T2T, full chromosome) assembly of a closely related, sister species.

My analysis process at the moment is:

Ragout refine each of my 20 assemblies vs the T2TReference
Rank my ragout scaffolds from best to worst (including consideration of how much of the initial assembly was unplaced)
Iteratively ragout refine each of the remaining 19 assemblies vs the T2TReference PLUS my best assembly's ragout refined scaffolds (i.e. two reference fastas).

Does this sound reasonable, or might this result in the overfitting of the remaining 19 assemblies to that one, best assembly?

While I don't have reason to think that any of my 20 strains should have massive differences between them, I don't want to obscure any smaller differences but over-favoring the assembly that just happed to have the best quality/coverage.

mikolmogorov · 2023-05-02T16:02:27Z

Hi @DaRinker

I think the simpler strategy of just using T2T reference for each strain may be sufficient. How much structural variance do you expect between the strains? Otherwise I think it makes sense, but definitely makes sense to compare it against the simple "baseline" approach with one reference!

DaRinker · 2023-05-02T17:10:34Z

Thanks for your input.

The problem with the simplest strategy it that using only the T2T genome for each strain results in a very mixed bag of results. Some of the 20 strains scaffold out nicely (i.e. I get 1-to-1 scaffolds for each of the T2T chromosomes) and others do not (I attribute this behavior to the 0.095 substitutions per site distance between the T2T sister species and each of the 20 strain assemblies). This issue can be remedied by adding back in the best of my scaffolded assemblies (so T2T strain plus best "in-species" assembly).

So I've decided to go with this approach moving ahead

mikolmogorov · 2023-05-16T15:11:48Z

I see, so some genomes are quite distant from each other - then I think it's a good strategy!

mikolmogorov added the question label May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices question--is "overfiting" to the reference sequence a problem? #82

Best practices question--is "overfiting" to the reference sequence a problem? #82

DaRinker commented Apr 24, 2023

mikolmogorov commented May 2, 2023

DaRinker commented May 2, 2023

mikolmogorov commented May 16, 2023

Best practices question--is "overfiting" to the reference sequence a problem? #82

Best practices question--is "overfiting" to the reference sequence a problem? #82

Comments

DaRinker commented Apr 24, 2023

mikolmogorov commented May 2, 2023

DaRinker commented May 2, 2023

mikolmogorov commented May 16, 2023