Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValidateSamFile wrong NM tag computation #1963

Open
1 of 2 tasks
albertocasagrande opened this issue May 29, 2024 · 3 comments
Open
1 of 2 tasks

ValidateSamFile wrong NM tag computation #1963

albertocasagrande opened this issue May 29, 2024 · 3 comments
Assignees

Comments

@albertocasagrande
Copy link

albertocasagrande commented May 29, 2024

Bug Report

Affected tool(s)

picard ValidateSameFile

Affected version(s)

  • Latest public release version [3.1.1]
  • Latest development/master branch (not tested)

Description

Calling picard ValidateSamFile with a reference sometimes produces ERROR:INVALID_TAG_NM even though the read alignment on the reference and CIGAR code match.

Steps to reproduce

  1. Download and decompress the reference sequence
curl https://ftp.ensembl.org/pub/grch37/release-111/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz -o chr22.fasta.gz
gunzip chr22.fasta.gz
  1. Create the file test.sam containing the following lines
@HD	VN:1.6	SO:unknown
@SQ	SN:22	LN:51304566	AN:chromosome22,chr22,chromosome_22,chr_22
@RG	ID:S_1_1	SM:S_1_1	PL:ILLUMINA
r00000000001	16	22	42063771	33	55M2D95M	*	0	0	CTTCTAGTGTTCCTCGCTACCCTGCAATTTTAGCATGACCATTTATTTATTTATGTTTGTTTGTTTATTTATTTATTTATGACTACAAAGATCCAGAAGACAAACATGACCATTTCTTTCTTTTTTTTTTTTTTTCTGAGATGGAGTCTT	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	RG:Z:S_1_1	NM:i:2
r00000000002	16	22	50364657	33	121M29X	*	0	0	GTGACAGGGGAGGAGTCTGGAGCTGAGAGGCGAACGGAGAGCACAGTGGAGCACACGGGCCCTGCCCACCCGCCTGTCCTGTCCAAGGATGCTGGGGCCCCGACCAGCCGGTCACAGGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNN	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!	RG:Z:S_1_1	NM:i:29
  1. Validate the sam file with Picard
java -jar picard.jar ValidateSamFile -I test.sam -R chr22.fasta

Expected behavior

The validation should succeed. The first read alignment and its CIGAR correspond. The second read perfectly matches the reference, but the final 29 'N's should be considered mismatches according to the NM specification given in "Sequence Alignment/Map Optional Fields Specification" (see at the bottom of p.3).

Actual behavior

Picard produces the following INVALID_TAG_NM errors:

ERROR::INVALID_TAG_NM:Record 1, Read name r00000000001, NM tag (nucleotide differences) in file [2] does not match reality [3]
ERROR::INVALID_TAG_NM:Record 2, Read name r00000000002, NM tag (nucleotide differences) in file [29] does not match reality [0]
@droazen
Copy link
Contributor

droazen commented Jun 11, 2024

The tool calls SequenceUtil.calculateSamNmTag() in HTSJDK to calculate the "expected" NM tag. That method internally calls a helper method SequenceUtil.countMismatches() with the matchAmbiguousRef argument set false. The matchAmbiguousRef toggle controls whether to do strict base matching, or match subsets of possible bases in the case of ambiguity codes, with N defined as an ambiguity code that matches any base.

It seems to me that this matchAmbiguousRef toggle in SequenceUtil.countMismatches() might be what we need here. We could expose it as an additional argument in a new overload of SequenceUtil.calculateSamNmTag() (and possibly as an argument to the ValidateSamFile tool itself).

Note that although the SAM spec does state that "ambiguity codes in both sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and ‘R’, are still counted as mismatches", the spec also acknowledges that "historically this has been ill-defined and both data and tools exist that disagree with this definition."

@kockan
Copy link
Contributor

kockan commented Jun 11, 2024

The problem seems to be that if both the reference and read base is N, no matter what matchAmbiguousRef is set to, htsjdk will accept this as a match. It obviously should if matchAmbiguousRef is set to true, but the default is false and the code jumps here: https://github.com/samtools/htsjdk/blob/127f3de750acb6a59fe1ceb5865ebbaee938847f/src/main/java/htsjdk/samtools/util/SequenceUtil.java#L132
where a simple equality check is performed, resulting in a match.

One potential idea (in case this is really the issue) might be to add something like https://github.com/samtools/htsjdk/blob/127f3de750acb6a59fe1ceb5865ebbaee938847f/src/main/java/htsjdk/samtools/util/SequenceUtil.java#L184 (but without the N) to only allow certain bases with matchAmbiguousRef set to false.

@kockan
Copy link
Contributor

kockan commented Jun 13, 2024

This was exactly the issue and I thought I had a fix for it, only to discover @michaelgatzen already has something in place from a while back: samtools/htsjdk#1536
Opinions @droazen @lbergelson ?

@droazen droazen changed the title VaridateSamFile wrong NM tag computation ValidateSamFile wrong NM tag computation Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants