Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Represent U in SAM rather than T for reads from RNA #801

Open
Psy-Fer opened this issue Oct 31, 2024 · 12 comments
Open

Represent U in SAM rather than T for reads from RNA #801

Psy-Fer opened this issue Oct 31, 2024 · 12 comments
Labels

Comments

@Psy-Fer
Copy link

Psy-Fer commented Oct 31, 2024

Before this change samtools/htslib#1854 U was changed to N when read by samtools

Now it will be changed to T

However, I think it would be "better" if we could preserve U in SAM, even when moving SAM->BAM->CRAM->SAM for example.

There is a problem, however, that there is no room in the 4bits BAM uses to represent all 16 IUPAC bases (where T is for T and U).

A solution to this raised by @jmarshall could be to allocate a FLAG bit to indicate an alignment record is RNA, which would then mean the T coming from a BAM, would be written as a U when viewed in SAM.

This would also mean most tools would still work, while building for the future of RNA sequencing methods to represent the base that is actually being measured.

Another solution (though more ad-hoc and less "good") would be to make yet another sam tag, to denote the read is from RNA. This saves using a FLAG bit, but adds more complexity to the solution.

Cheers,
James

@jkbonfield
Copy link
Contributor

This is possible, but it's something to raise at hts-specs. Htslib is just one tool implementing the widely agreed specification standards, and it won't help the community to fork this. If we add an extra FLAG bit then it needs adding to the specification and then all implementations need to agree and implement it. Hence this wouldn't be a quick fix.

A private space SAM tag wouldn't break the specification at least, but it does still introduce different behaviour in different tools so should really get discussed in the GA4GH meetings too. Can you please raise an issue there regarding how to indicate RNA vs DNA?

An alternative would simply be to add a header field so the user is aware that this is an RNA library and all Ts are Us. This feels like a safer option that doesn't risk breaking downstream processing tools. It could naturally fit into the @RG header with minimal impact beyond that. Software could then decide whether to apply the T->U translation based on that header.

@jkbonfield
Copy link
Contributor

Ah I see #800 already, which is related. I should have finished reading all my emails. :)

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 31, 2024

Good point - I forgot I can just transfer things. :)

Edit: now transferred.

@jkbonfield jkbonfield transferred this issue from samtools/samtools Oct 31, 2024
@jmarshall
Copy link
Member

The context for this issue is this thread. On brief consideration, it seems to me that a header field indicator would not work brilliantly when merging a U file and a T file — not that anyone should be merging DNA and RNA BAMs perhaps! Hence the suggestion to indicate this per-read.

Since @Psy-Fer mentioned that hybrid cases (reads containing both U and T) would usefully be representable, I'm perhaps leaning more towards a tagged field rather than a flag bit. This could be akin to the methylation tags, e.g. an array tag identifying the T bases that are actually U bases, along with some succinct special value indicating that all of this read's Ts are actually Us.

But the flag bit idea does have an attractive simplicity…

@jkbonfield
Copy link
Contributor

I should also raise CRAM here.

BAM we know only stores data in nibbles and it's simply impossible for it to represent U as U. It must be T and use a side-channel to flag it.

CRAM stores bases as characters, so technically it can store anything there. However, it's also (ideally) referenced based encoding for aligned data. It would be a tragic mistake if for RNA we started storing an edit for each and every U vs T in the genome! Given we need a side-channel mechanism for BAM, I would advocate the same mechanism is applied with CRAM. Ie we present T to the encoder and, if desired, have the decoder translate back on-the-fly.

Note that "if desired" is key. There are many tools out there that cannot accept U. Minimap2 may, but from what I can see bwa doesn't. I imagine this is true of a lot of aligners where they just do some hashing technique and may not have remembered to add U=T into the table. So automatically round tripping and giving the same data back may not always be the solution that is most useful to the user. This also leads us to the side-channel approach, even for SAM.

My personal preference would be for updating the @RG header as I don't personally see the need for a flag given it's a fundamental property of the entire library as far as I can tell. Hypothetically we could have sequencing instruments which can on-the-fly disambiguate U from T. Theoretically ONT could, but I assume you specify it front which DNA / RNA calibration you're using when base calling as it'd considerably improve the accuracy of calls and also remove impossibilities such as calling RNA with a smattering of DNA bases or vice versa.

@jkbonfield
Copy link
Contributor

Since @Psy-Fer mentioned that hybrid cases (reads containing both U and T) would usefully be representable, I'm perhaps leaning more towards a tagged field rather than a flag bit. This could be akin to the methylation tags, e.g. an array tag identifying the T bases that are actually U bases, along with some succinct special value indicating that all of this read's Ts are actually Us.

@Psy-Fer can you please give more information on the hybrid case. How does this arise? Is it biologically relevant, or just an artifact of a calling pipeline?

@Psy-Fer
Copy link
Author

Psy-Fer commented Oct 31, 2024

Hybrid cases are not really an issue, only thinking in terms of possibilities in the future.

Currently dRNA sequencing with ONT for example has a DNA adapter with RNA strand. But the basecaller doesn't basecall the adapter, so you never actually get those sequences. This doesn't mean that would always be the case though, and just something ONT has decided. Anyone that works with ONT will note that anything can change, so I was just pointing it out as something to keep in mind after it was raised by David Eccles on mastodon. There are also other companies that are coming out with tech that can sequence RNA directly, so I'm thinking getting this kind of thing sorted now is in everyone's best interest before weird ad-hoc solutions start getting created.

As for a not on the @rg tag, you can have multiple RGs, and when mergin, you will get the RGs from both files, and each read links to its own unique RG (if the names are unique, which they hopefully are). So yes, this could potentially be a nice way of doing it.

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 31, 2024

I'm not sure practically speaking it matters much about library construction here for SAM/BAM et al. The adapters would get trimmed away anyway pre alignment. If we consider an unaligned file then there needs to be aditional meta-data stored already to indicate the number of bases to trim etc, so that can be used too to indicate the adapter is DNA if needed.

Normally in Illumina world we move data from the sequence (eg barcodes) to aux tags prior to storing in unaligned BAM, so I would expect a similar thing to happen here, meaning the SAM field is only still the template that the user wanted sequencing. Pragmatically speaking, that's the way to go to leverage all the existing tools as aligners etc no longer need to be aware of platform specific adapters as SEQ is pre-trimmed for them.

Edit: there are some technologies where we circularise and do a long read across the junction, so SEQ -> adapter -> SEQ, with the adapter aware aligner then recognising this. I guess in theory such data could then go from RNA to DNA to RNA, but pragmatically I doubt it matters much as far as analysis goes given aligners treat both T and U the same anyway. Also, such library techniques are doing this because they don't know where the junction is and it's inferred later on, so they're not going to be writing out a mix of U and T as they don't know where one part finishes and the next starts.

@jmarshall
Copy link
Member

Yes, header-wise I was imagining something on the @HD line. But an attribute on the @RG lines makes more sense and side-steps the merging problem very nicely. 👍

@michaelmhoffman
Copy link
Contributor

Keep in mind that epigenetic sequencing assays often do things like convert certain DNA nucleosides to deoxyuridine. The best-known of these is bisulfite sequencing. Now bisulfite sequencing is usually followed by PCR which changes all the deoxyuracils into deoxythymines but with the single-molecule sequencing techniques around these days, I wouldn't want to assume that you're never going to want to distinguish between thymine and uracil in the same read.

@jkbonfield
Copy link
Contributor

That's a good point.

However fundamentally it breaks BAM. BAM simply cannot encode U. So much goes via BAM (including htslib's core in-memory format due to where it came from), that not treating U as T is tantamount to simply treating U as N. Allowing SAM to have greater specificity than BAM was, err, a "courageous decision". Or rather it was likely simply a sloppy translation from Apple Pages format to LaTeX. (I think @jmarshall commented on this during the last meeting.)

So my take on this is we treat U as T in the SEQ column, and if it matters for our particular assay then we add base modification tags to those Ts to describe them as U. That's about the only backwards compatible approach I can think of that captures the information. It's doable right now via the existing base modification tags and a CHEBI code, but maybe we should make an exception here and add a short code for it given it's a fundamentally well understood and accepted letter.

There is a secondary topic of what should decoders do. If they see a record with the PG field indicating it's RNA, then in theory they should decode all the T bases as U again. However that may well cause breakage in other software, so I think it should be permitted behaviour to not do that and let the user decide.

@michaelmhoffman
Copy link
Contributor

The base modification treatment is a good idea. I would also support a U short code.

In short, I don't think you need to make any effort to enable distinguishing U and T in the same read now, but you may want to during the life of these standards. A Lindy effect estimate might be ~15 years from now, or 2039. So I would avoid baking in further assumptions that this won't happen. (Unless doing so provides great benefit!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants