Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduper Psedocode Feedback #1

Open
Farrisdt opened this issue Oct 16, 2023 · 0 comments
Open

Deduper Psedocode Feedback #1

Farrisdt opened this issue Oct 16, 2023 · 0 comments

Comments

@Farrisdt
Copy link

Using UMI as they key for the banned dictionary would make it so it can only track one duplication per UMI. There will be millions of lines so the UMIs will be used multiple times. In addition, be sure you have the RAM space for a dictionary as large as you will need.

How will you be finding the 5' leftmost of read? How do we know if there is soft clipping? What about reverse strands?

What is the purpose of the start position dictionary? This will have an entry for almost every line, which will be very memory intensive.

It seems like you are only comparing UMIs after filtering out all reverse strands. You also need to compare start locations. You are using a dictionary again for the UMIs, remember that we are not supposed to be holding records in memory.

Overall, I think you need to rethink your data structure, and be sure you are accounting for all variables of uniqueness. Be sure to also account for reverse strands and not just keep the forward ones. Organising into functions may help. I also would add some more lines to your test file. Be sure you consider every possible way the program could error and put examples in the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant