[IO] Parsing for genome alignment formats (MAF and HAL) #2903

bkille · 2021-11-25T18:49:41Z

Big fan of the Seqan3 library, thanks for all of your hard work! I have been using the MSA parser recently and it has worked quite well, however, the majority of my work centers around genome alignments. It would be extremely useful if there were parsers for MAF and HAL alignment formats.

marehr · 2021-11-26T14:02:56Z

Hi bkille, thank you for your kind words! :)

Can you give us some links to the file specification and some example files to see what the key features are?
Maybe also some context in which field they are generally used?

Do you just need a parser or also a "writer"?

Can you give us the link to the MSA Parser?

Is there something that gets in your way of using the MSA Parser with seqan3?
Do you have a code snippet / link to your repo that shows the usage of both together and your use case?

bkille · 2021-11-26T18:38:34Z

It's the least I could do!

Of course, sorry if I missed documentation on how to open a feature request/proposal.

Can you give us some links to the file specification and some example files to see what the key features are?

Here is the official documentation for MAF, with additional docs at BioPython and more basic docs from PSU. The tldr is that these file formats represent a set of local multiple sequence alignments from the same set of sequences.

Here is the github page for the tool that implements the hierarchical alignment format (HAL). The publication describing the format is here.

FWIW, MAF is still more commonly used than HAL I believe.

Maybe also some context in which field they are generally used?

They are used to represent multiple genome alignments. In global MSA, the two sequences are expected to be "co-linear", meaning that if there is an alignment between base pair i from sequence 1 and j from sequence 2, any alignment between n and m is not allowed if (n < i && m > j) || (n > i && m < j). Essentially all alignments must be non-crossing. Multiple genome alignment (MGA) do not have that restriction. This is because genomes are often rearranged, so in order to capture the full homology it is necessary to let any part of one genome be related to any part of another.

Genome alignments are often used in comparative genomics studies. Here is a recent and very brief publication that describes the most recent method for genome alignment. Comparative genomics studies can identify patterns of evolution/selection and provide a better basis for constructing phylogenies, among many other uses. They are becoming more and more popular due to the increase and quality and quantity of genome assemblies.

Do you just need a parser or also a "writer"?

Frankly both would be ideal.

Can you give us the link to the MSA Parser?

So I haven't been using an MSA parser from Seqan3, so much as I have been using the fasta reader to construct MSAs. A multiple sequence alignment is essentially a set of equivalence groups, so I read in each fasta file and assign the nucleotides to homology groups based on the column. Here is my file which parses two input MSA files and computes the accuracy. It's not necessarily the most efficient or clean, though 😛

is there something that gets in your way of using the MSA Parser with seqan3?

Frankly, I wasn't aware of an MSA parser in Seqan3 😬

smehringer · 2021-11-29T08:03:54Z

Hi @bkille,

thanks a lot for all your input!

Frankly, I wasn't aware of an MSA parser in Seqan3 😬

To be clear: There is no MSA parser in SeqAn3 😢
I guess @marehr thought you are already using another MSA parser from another library.

Here is my file which parses two input MSA files and computes the accuracy.

From a quick sweep over the code (which I think do looks quite clean) it looks like reading in gapped fasta serves your purpose well.

In your application, what would be the ideal output of an MSA Parser?
The MSA itself as a container of gapped sequnces that you can iterate over column wise?

Unfortunately though, I don't know how soon we can work on this because our group got diminished quite a bit by people finishing their PhD and we need to see how we can distribute our resources.

bkille · 2021-11-30T04:22:22Z

In your application, what would be the ideal output of an MSA Parser?
The MSA itself as a container of gapped sequnces that you can iterate over column wise?

Personally, I don't think Seqan3 really needs a MSA parser. The fasta reader w/ gapped alphabet should satisfy most needs. If anything, a tutorial entry about using type traits to allow reading gapped sequences would be most helpful.

However, the representation of a genome alignment (MAF file) is a more complicated question. I think the Biopython representation is a reasonable start, and don't see any reason why it should be changed. tldr; two levels of iteration: (1) iterate over alignments in the MAF file, (2) iterate over gapped sequences within an alignment.

Unfortunately though, I don't know how soon we can work on this because our group got diminished quite a bit by people finishing their PhD and we need to see how we can distribute our resources.

Ahh, I'm sorry to hear that 🙁. Considering that I will without a doubt be in need of a MAF parser for my PhD, I would be happy to help contribute! After all, if I don't get it from Seqan3, I will have to write it myself.

smehringer · 2021-12-01T07:25:35Z

Ahh, I'm sorry to hear that 🙁. Considering that I will without a doubt be in need of a MAF parser for my PhD, I would be happy to help contribute! After all, if I don't get it from Seqan3, I will have to write it myself.

That's a most generous offer we'd love to take! I'll put it up for discussing a plan in our next team meeting.

What would be your time frame with this project?
Would you want to tackle this immediately or would starting out in the new year be fine?

bkille · 2021-12-01T17:09:48Z

New year is definitely fine, not in a terrible rush.

For some context wrt C++, I worked on implementing this C++ proposal w/ Vincent Reverdy (a member of France's C++ ISO committee).

The requirements for the project required that each group of algorithms in <algorithms> be in a separate file. Some notable files are:
https://github.com/vreverdy/bit-algorithms/blob/master/include/bit_algorithm_details.hpp
https://github.com/bkille/bit-algorithms/blob/rai-overloads/include/shift.hpp
https://github.com/vreverdy/bit-algorithms/blob/master/include/reverse.hpp

bkille added the feature/proposal a new feature or an idea of label Nov 25, 2021

marehr changed the title ~~Parsing for genome alignment formats~~ Parsing for genome alignment formats (MAF and HAL) Nov 27, 2021

eseiler changed the title ~~Parsing for genome alignment formats (MAF and HAL)~~ [IO] Parsing for genome alignment formats (MAF and HAL) Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IO] Parsing for genome alignment formats (MAF and HAL) #2903

[IO] Parsing for genome alignment formats (MAF and HAL) #2903

bkille commented Nov 25, 2021

marehr commented Nov 26, 2021 •

edited

Loading

bkille commented Nov 26, 2021 •

edited

Loading

smehringer commented Nov 29, 2021

bkille commented Nov 30, 2021 •

edited

Loading

smehringer commented Dec 1, 2021

bkille commented Dec 1, 2021 •

edited

Loading

[IO] Parsing for genome alignment formats (MAF and HAL) #2903

[IO] Parsing for genome alignment formats (MAF and HAL) #2903

Comments

bkille commented Nov 25, 2021

marehr commented Nov 26, 2021 • edited Loading

bkille commented Nov 26, 2021 • edited Loading

smehringer commented Nov 29, 2021

bkille commented Nov 30, 2021 • edited Loading

smehringer commented Dec 1, 2021

bkille commented Dec 1, 2021 • edited Loading

marehr commented Nov 26, 2021 •

edited

Loading

bkille commented Nov 26, 2021 •

edited

Loading

bkille commented Nov 30, 2021 •

edited

Loading

bkille commented Dec 1, 2021 •

edited

Loading