-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find heuristics to pre-filter insertions that we don't want spoa to spend too much time on #33
Comments
currently: if any of the insertions is longer than 7500bp, the expansion is simply discarded. |
I agree that current strategy is bad. I think we should develop some sort of classic outlier detection algo. When I was debugging the script, I observed that many of the insertions have pretty similar length. It is, in some sense, expected and makes a picture easier than for short reads. So, we potentially can calculate length mean and than disregard some examples that have much longer insertions. We never disregard shorter one since it can be produced from soft clipping sequence. Overall, it is very interesting question since we are using MSA on the later step. In some sense, it would always report a consensus sequence with maximal length if I understand everything correctly. So, it is crucially to do filtration based on insertion lengths here. From some perspective, we are doing genotyping at this step. Later, we just find a sequence. Some additional ideas: Calculate the most common substring for set of insertions and mean of repeat units. Then, tools tries to find parameters (via max likelihood) by assuming nanopore error model (if any) that allow generate something similar to observations (means or some nice distribution). After that, we disregard everything else. |
what I would like to add to this is that there could be long softclips that are actually unrelated to the expansion but are just normal sequence, e.g. from a chimeric molecule. Those are rare, but removing those would be a good thing. Those would probably be an outlier |
@wdecoster I am also thinking that we should parse MSA more carefully and evaluate the support of each letter. If, for example, some letters is supported just by one or two sequences from alignments, they are good candidates to be removed. |
Like length and/or repetitiveness
The text was updated successfully, but these errors were encountered: