Skip to content

Ancestral reconstruction bug fixes#1975

Open
jameshadfield wants to merge 5 commits intomasterfrom
james/ancestral-bug-fixes
Open

Ancestral reconstruction bug fixes#1975
jameshadfield wants to merge 5 commits intomasterfrom
james/ancestral-bug-fixes

Conversation

@jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Mar 24, 2026

Fixes reconstruction bugs in augur ancestral, most notably when we would use the hardcoded 'N' character as the ambiguous state for AA sequences, and thus report erroneous mutations to N = Asn = Asparagine. See added tests for full details.

These bugs were noticed during refactoring as part of #1958, but they are not a result of the refactor. I've cherry-picked them into a new PR as they are unrelated to that work.

I tested on a 5.5k H3N2 HA dataset and there were no occurrences of these bugs implying that they may be rare. (The worst would need ambiguous ("X") states in the translated tip sequences, which nextclade may never produce??)

Relatedly, I've always wanted to check reconstructed translations against what the reconstructed nucleotide sequence would translate to. The final commit indicates mismatches here in an opt-in fashion.

In preparation for allowing protein-only reconstructions
Tests describe bugs relating to our handling of ambiguous states.

The nucleotide bugs are minor, relating to handling of 'X'

The AA bugs are bigger, as we hardcode "N" as the ambiguous state
however N = Asn = Asparagine.
Previously we would reconstruct ~any sequence in TreeTime and then
correct the resulting sequences via the `character_map`. This was error
prone, both due to inconsistent application of these corrections as well
as not distinguishing between alphabets correctly (see failing tests in
parent commit).

Here we invert the logic so that we fully correct alignments and
reference sequences before inference. This means all states in the
alignment / ref are valid; ambiguous states such as "R" (nuc) and "Z"
(aa) are valid.
These arguments apply to both nuc & aa reconstructions
@jameshadfield jameshadfield added priority: high To be resolved before other issues and removed priority: high To be resolved before other issues labels Mar 24, 2026
There's been a tension between the two ways we handle translations in
augur: the "old" way of inferring ancestral seqs then translating them
vs the "new" way of using nextclade to get translated tip-sequences and
reconstructing nucleotide & aa seqs independently across the tree.

The new way is arguable better, as nextclade has some alignment
heuristics around CDSs to produce better translations, but opens the
door to having mismatches between the inferred nuc seq and the
corresponding AA residue. These differences would be surfaced in Auspice
when looking at branch mutations.This commit adds an optional warning to
check for such mismatches.

A trial n=5.5k H3N2 HA dataset had the following differences:
- SigPep: 2 terminal nodes. Median residue mismatch count: 13 (range: 1 - 13)
- HA1: 2 terminal nodes and 4 internal nodes. Median residue mismatch
  count: 3 (range: 1 - 5)
- HA2: 1 terminal node and 1 internal node were different. Median
  residue mismatch count: 2 (range: 1 - 2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant