Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices question--is "overfiting" to the reference sequence a problem? #82

Open
DaRinker opened this issue Apr 24, 2023 · 3 comments
Labels

Comments

@DaRinker
Copy link

I have 20 de novo hybrid genome assemblies (ONT plus Illumina; flye plus pylon polishing) of different strains of the same species.
For an initial genome reference, we also have a high quality (T2T, full chromosome) assembly of a closely related, sister species.

My analysis process at the moment is:

  1. Ragout refine each of my 20 assemblies vs the T2TReference
  2. Rank my ragout scaffolds from best to worst (including consideration of how much of the initial assembly was unplaced)
  3. Iteratively ragout refine each of the remaining 19 assemblies vs the T2TReference PLUS my best assembly's ragout refined scaffolds (i.e. two reference fastas).

Does this sound reasonable, or might this result in the overfitting of the remaining 19 assemblies to that one, best assembly?

While I don't have reason to think that any of my 20 strains should have massive differences between them, I don't want to obscure any smaller differences but over-favoring the assembly that just happed to have the best quality/coverage.

@mikolmogorov
Copy link
Owner

Hi @DaRinker

I think the simpler strategy of just using T2T reference for each strain may be sufficient. How much structural variance do you expect between the strains? Otherwise I think it makes sense, but definitely makes sense to compare it against the simple "baseline" approach with one reference!

@DaRinker
Copy link
Author

DaRinker commented May 2, 2023

Thanks for your input.

The problem with the simplest strategy it that using only the T2T genome for each strain results in a very mixed bag of results. Some of the 20 strains scaffold out nicely (i.e. I get 1-to-1 scaffolds for each of the T2T chromosomes) and others do not (I attribute this behavior to the 0.095 substitutions per site distance between the T2T sister species and each of the 20 strain assemblies). This issue can be remedied by adding back in the best of my scaffolded assemblies (so T2T strain plus best "in-species" assembly).

So I've decided to go with this approach moving ahead

@mikolmogorov
Copy link
Owner

I see, so some genomes are quite distant from each other - then I think it's a good strategy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants