Deduper Psedocode Feedback #1

Farrisdt · 2023-10-16T19:18:59Z

Using UMI as they key for the banned dictionary would make it so it can only track one duplication per UMI. There will be millions of lines so the UMIs will be used multiple times. In addition, be sure you have the RAM space for a dictionary as large as you will need.

How will you be finding the 5' leftmost of read? How do we know if there is soft clipping? What about reverse strands?

What is the purpose of the start position dictionary? This will have an entry for almost every line, which will be very memory intensive.

It seems like you are only comparing UMIs after filtering out all reverse strands. You also need to compare start locations. You are using a dictionary again for the UMIs, remember that we are not supposed to be holding records in memory.

Overall, I think you need to rethink your data structure, and be sure you are accounting for all variables of uniqueness. Be sure to also account for reverse strands and not just keep the forward ones. Organising into functions may help. I also would add some more lines to your test file. Be sure you consider every possible way the program could error and put examples in the file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduper Psedocode Feedback #1

Deduper Psedocode Feedback #1

Farrisdt commented Oct 16, 2023

Deduper Psedocode Feedback #1

Deduper Psedocode Feedback #1

Comments

Farrisdt commented Oct 16, 2023