-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IO] Parsing for genome alignment formats (MAF and HAL) #2903
Comments
Hi bkille, thank you for your kind words! :) Can you give us some links to the file specification and some example files to see what the key features are? Do you just need a parser or also a "writer"? Can you give us the link to the MSA Parser? Is there something that gets in your way of using the MSA Parser with seqan3? |
It's the least I could do! Of course, sorry if I missed documentation on how to open a feature request/proposal.
Here is the official documentation for MAF, with additional docs at BioPython and more basic docs from PSU. The tldr is that these file formats represent a set of local multiple sequence alignments from the same set of sequences. Here is the github page for the tool that implements the hierarchical alignment format (HAL). The publication describing the format is here. FWIW, MAF is still more commonly used than HAL I believe.
They are used to represent multiple genome alignments. In global MSA, the two sequences are expected to be "co-linear", meaning that if there is an alignment between base pair Genome alignments are often used in comparative genomics studies. Here is a recent and very brief publication that describes the most recent method for genome alignment. Comparative genomics studies can identify patterns of evolution/selection and provide a better basis for constructing phylogenies, among many other uses. They are becoming more and more popular due to the increase and quality and quantity of genome assemblies.
Frankly both would be ideal.
So I haven't been using an MSA parser from Seqan3, so much as I have been using the fasta reader to construct MSAs. A multiple sequence alignment is essentially a set of equivalence groups, so I read in each fasta file and assign the nucleotides to homology groups based on the column. Here is my file which parses two input MSA files and computes the accuracy. It's not necessarily the most efficient or clean, though 😛
Frankly, I wasn't aware of an MSA parser in Seqan3 😬 |
Hi @bkille, thanks a lot for all your input!
To be clear: There is no MSA parser in SeqAn3 😢
From a quick sweep over the code (which I think do looks quite clean) it looks like reading in gapped fasta serves your purpose well. In your application, what would be the ideal output of an MSA Parser? Unfortunately though, I don't know how soon we can work on this because our group got diminished quite a bit by people finishing their PhD and we need to see how we can distribute our resources. |
Personally, I don't think Seqan3 really needs a MSA parser. The fasta reader w/ gapped alphabet should satisfy most needs. If anything, a tutorial entry about using type traits to allow reading gapped sequences would be most helpful. However, the representation of a genome alignment (MAF file) is a more complicated question. I think the Biopython representation is a reasonable start, and don't see any reason why it should be changed. tldr; two levels of iteration: (1) iterate over alignments in the MAF file, (2) iterate over gapped sequences within an alignment.
Ahh, I'm sorry to hear that 🙁. Considering that I will without a doubt be in need of a MAF parser for my PhD, I would be happy to help contribute! After all, if I don't get it from Seqan3, I will have to write it myself. |
That's a most generous offer we'd love to take! I'll put it up for discussing a plan in our next team meeting. What would be your time frame with this project? |
New year is definitely fine, not in a terrible rush. For some context wrt C++, I worked on implementing this C++ proposal w/ Vincent Reverdy (a member of France's C++ ISO committee). The requirements for the project required that each group of algorithms in |
Big fan of the Seqan3 library, thanks for all of your hard work! I have been using the MSA parser recently and it has worked quite well, however, the majority of my work centers around genome alignments. It would be extremely useful if there were parsers for MAF and HAL alignment formats.
The text was updated successfully, but these errors were encountered: