Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find heuristics to pre-filter insertions that we don't want spoa to spend too much time on #33

Open
wdecoster opened this issue Oct 13, 2021 · 4 comments

Comments

@wdecoster
Copy link
Collaborator

Like length and/or repetitiveness

@wdecoster
Copy link
Collaborator Author

currently: if any of the insertions is longer than 7500bp, the expansion is simply discarded.
That is not desirable. A less bad solution would be to pick just one of the insertions and set that as the seq (which is then not polished by the consensus) but it is at least not lost

@PavelAvdeyev
Copy link
Collaborator

PavelAvdeyev commented Oct 13, 2021

I agree that current strategy is bad.

I think we should develop some sort of classic outlier detection algo. When I was debugging the script, I observed that many of the insertions have pretty similar length. It is, in some sense, expected and makes a picture easier than for short reads. So, we potentially can calculate length mean and than disregard some examples that have much longer insertions. We never disregard shorter one since it can be produced from soft clipping sequence.

Overall, it is very interesting question since we are using MSA on the later step. In some sense, it would always report a consensus sequence with maximal length if I understand everything correctly. So, it is crucially to do filtration based on insertion lengths here. From some perspective, we are doing genotyping at this step. Later, we just find a sequence.

Some additional ideas:
Calculate the most common substring for set of insertions. This gives us rough estimate of motif length. After that, we can consider interval of [0, length mean + k * rough motif length) to filter something long.

Calculate the most common substring for set of insertions and mean of repeat units. Then, tools tries to find parameters (via max likelihood) by assuming nanopore error model (if any) that allow generate something similar to observations (means or some nice distribution). After that, we disregard everything else.

@wdecoster
Copy link
Collaborator Author

what I would like to add to this is that there could be long softclips that are actually unrelated to the expansion but are just normal sequence, e.g. from a chimeric molecule. Those are rare, but removing those would be a good thing.

Those would probably be an outlier

@PavelAvdeyev
Copy link
Collaborator

PavelAvdeyev commented Oct 13, 2021

@wdecoster I am also thinking that we should parse MSA more carefully and evaluate the support of each letter. If, for example, some letters is supported just by one or two sequences from alignments, they are good candidates to be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants