-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Most pseudogenes have no read alignment from Paraphase output #26
Comments
Hi Min, Paraphase is designed to consider PMS2 and PMS2CL jointly, because there is no misalignment between PMS2 and other pseudogenes. If you see misalignments to other pseudogenes, could you give an example? Thanks, |
Hi Xiao, Thank you for responding to me. May I ask how you identify misalignments please? Shown below are read alignments to the PMS2P2 and PMS2P5 genes. The alignments were sorted by mapping quality and shade by mapping quality high. In each figure, the top panel is the pbmm2 alignment, and the bottom panel is the paraphase alignment. We can see that reads with low mapping qualities were mapped to the PMS2P2 and PMS2P5 genes by pbmm2. Reads mapping to the PMS2P2 gene have a lower mapping quality than those mapping to the PMS2P5 gene. Reads mapping to the downstream of the PMS2P5 gene, especially those spanning across the deletion, have a mapping quality of 60. Would you consider reads with lower mapping qualities as misalignment? Perhaps those reads could have been realigned to other PMS2 pseudogenes? Many thanks, |
Hi Min, Yes the low MAPQs reflect that there are mapping issues. This is because those pseudogenes (those named PMS2P#) have high sequence similarity between each other. This is a separate problem from PMS2-PMS2CL, as PMS2-PMS2CL and PMS2P# are very different in sequence. Paraphase is centered on genes so far, so we haven't included those pseudogene-only families. Are you interested in studying these psuedogenes even when they are not homologous to PMS2? Thanks, |
Hi Xiao, Could you take a look at Table 2 in this publication please? https://www.mdpi.com/1422-0067/24/2/1398 Many thanks, |
Hi Min, A HiFi read is 100 times longer than a short read, so it provides much more information in alignment. Therefore, a region that has alignment problems due to sequence homology in short read data may not have any alignment problem in long read data. If you align PMS2 to the entire genome, aside from matches to PMS2CL, the remaining matches are all shorter than 4kb at a sequence similarity of 91% or lower- these are different enough and short enough, and would not create any alignment issues for HiFi reads. If you do see any misalignment between PMS2 and PMS2P# genes, please share them here and I'd be happy to look into it. Thanks, |
Hi Xiao, Thank you for developing the tool paraphase. It is a great tool, and I would love to see HiFi reads being applied to more applications. Texts are all we have now, although texts might not be the best way to communicate as all the tones and non-verbal communication are neglected. If my words came across as picky and harsh, I apologise in advance. I did not mean to. The IGV screenshots that I showed 5 days ago are HiFi reads aligning to the PMS2P2 and PMS2P5 genes, using the default settings of pbmm2. PMS2 and its pseudogenes are called challenging medically relevant genes due to their high sequence similarities to each other. In this case, HiFi reads also had difficulty finding their unique mapping location. I think it would be really helpful if paraphrase could expand its joint consideration from PMS2 and PMS2CL to PMS2 and all its pseudogenes. That's the only reason why I submitted this ticket in the first place. Many thanks, |
Hi Min, If your goal is to call variants in PMS2, you can use the current setup in Paraphase, It’s only necessary to consider PMS2 and PMS2CL. Other pseudogenes (PMS2P#) will not cause alignment problems in PMS2, i.e. no PMS2P# reads will align to PMS2 and no PMS2 reads will align to PMS2P#. This is because the sequence similarity between PMS2 and PMS2P# genes is low enough for HiFi reads to align correctly. If your goal is to instead call variants in those PMS2P# genes, then we would need to add a new region definition in Paraphase so that PMS2P# genes can be considered together as a group (this group does not include PMS2/PMS2CL). This is because PMS2P# genes are highly similar in sequence to each other and there can be misalignments among PMS2P# genes, as seen in the examples you shared. Again, we do not expect misalignments between PMS2 and PMS2P# genes. Please note that all my statements above are based on HiFi data. It’s a different story for short reads. Thanks, |
Hello,
Thank you for developing the tool. :)
It is stated that:
Although long reads have trouble aligning to PMS2 pseudogenes other than PMS2CL, in the paraphrase outputs, no reads align to most PMS2 pseudogenes, except PMS2CL.
Is this because paraphase was designed to consider PMS2 and PMS2CL jointly, but not jointly with the other pseudogenes?
Many thanks,
Min
The text was updated successfully, but these errors were encountered: