Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sam: Consider recommending using 0 for mapping quality when record is unmapped #727

Open
zaeleus opened this issue Jun 8, 2023 · 3 comments · May be fixed by #753
Open

sam: Consider recommending using 0 for mapping quality when record is unmapped #727

zaeleus opened this issue Jun 8, 2023 · 3 comments · May be fixed by #753
Assignees
Labels

Comments

@zaeleus
Copy link

zaeleus commented Jun 8, 2023

As suggested by @jkbonfield in #715 (comment):

as an aside, I note you're also using MAPQ of 255 for "unavailable". Commendable, but my experience is that everyone just uses 0 with unmapped data. I think this is because when FLAG 4 is set the specification states no assumption can be made about MAPQ, so it just feels cleaner to zero it out as all other fields have been.

@jmarshall jmarshall added the sam label Jun 20, 2023
@jmarshall
Copy link
Member

jmarshall commented Jun 20, 2023

I have always thought that the text about “no assumptions can be made about [these fields] in [these circumstances]” is a mistake that will prevent us from defining meanings for those fields in those currently unspecified circumstances, as discussed in e.g. this samtools-devel thread.

However over a decade later, I fear the ship may have sailed on changing this policy.

@jmarshall jmarshall changed the title sam: Recommend using 0 for mapping quality when record is unmapped sam: Consider recommending using 0 for mapping quality when record is unmapped Jun 20, 2023
@daviesrob daviesrob moved this to New items in GA4GH File Formats Jun 20, 2023
@d-cameron
Copy link
Contributor

Although "no assumptions can be made", if one were to be making assumptions, I would interpret a mapq defined over an unmapped read to be the mapping probability of the read being unmapped w.r.t the reference used. That is, reads for which the aligner had candidates and gave up (e.g. bwa when every seed has >500 occurrences in the reference) would be mapq0 (since it has no confidence that it's actually unmapped) and reads which don't match any reference (e.g. contamination, primers), would have non-zero mapq.

@jkbonfield
Copy link
Contributor

@d-cameron I agree that could have made sense, but it's too late now and I don't think it's how aligners work so I expect such values wouldn't be sensible calibrated anyway. Although some proxy based on complexity and length (ie an entropy estimation) could be used as an indicator that it is genuine data not found in this reference (but perhaps is part of this genome, eg from a long insertion).

@jmarshall - also agreed the language about "no meaning" isn't always helpful, but as you say the ship has sailed. However I think it is reasonable to make recommendations (not requirements) on the values when they are essentially NOPs. Eg CIGAR has no meaning for unmapped data, but sanity checkers may well gripe about something with CIGAR 151M and indeed it's been reported as bugs before in aligners, despite it being technically valid SAM. So a footnote recommending the values to use when they have no meaning would be useful and wouldn't be a breaking change. I'm unsure with MAPQ though: CIGAR * is "unavailable" and is used for unmapped data, so arguably MAPQ 255 is also correct here too, however I'm thinking more for the sake of conformity with existing practice.

@jkbonfield jkbonfield moved this from New items to Progressing in GA4GH File Formats Jul 17, 2023
@jkbonfield jkbonfield moved this from Progressing to New items in GA4GH File Formats Jul 17, 2023
@jkbonfield jkbonfield self-assigned this Jul 18, 2023
@jkbonfield jkbonfield moved this from New items to Progressing in GA4GH File Formats Jul 18, 2023
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jan 29, 2024
We already have a recommendation that MAPQ of 255 should not be used.
Expand on this to recommend zero is used when unmapped.

This is purely a recommendation, made for maximum compatibility, and
not a specification requirement.

Fixes samtools#727
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Progressing
Development

Successfully merging a pull request may close this issue.

4 participants