Fix CRAM embed_ref=2 with seqs overlapping ref end. #1848
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If the sequences align off the end of the reference and we are creating consensus on the fly, then the consensus generated also steps beyond the reference length. Although this longer reference is embedded, it is trimmed back by the CRAM decoder which validates against the declared reference length in SQ LN, leading to Ns appearing in the decoder.
Therefore we now validate in the encoder too, which also needed refs_from_header updates to parse the LN tag so the encoder can trim. Note we already overloaded r->length==0 for an indication that we've not parsed the fa/fai file yet, so we can't just naively fill this out from the SQ LN header. We could hold this information elsewhere via a proper flag and modify all the places that utilise that knowledge, but the simplest (and safest) fix is to have a separate variable used for this one specific case.
An example of failure could be seen in: