Ancestral reconstruction bug fixes by jameshadfield · Pull Request #1975 · nextstrain/augur

jameshadfield · 2026-03-24T02:36:43Z

Fixes reconstruction bugs in augur ancestral, most notably when we would use the hardcoded 'N' character as the ambiguous state for AA sequences, and thus report erroneous mutations to N = Asn = Asparagine. See added tests for full details.

These bugs were noticed during refactoring as part of #1958, but they are not a result of the refactor. I've cherry-picked them into a new PR as they are unrelated to that work.

I tested on a 5.5k H3N2 HA dataset and there were no occurrences of these bugs implying that they may be rare. (The worst would need ambiguous ("X") states in the translated tip sequences, which nextclade may never produce??)

Relatedly, I've always wanted to check reconstructed translations against what the reconstructed nucleotide sequence would translate to. The final commit indicates mismatches here in an opt-in fashion.

In preparation for allowing protein-only reconstructions

Tests describe bugs relating to our handling of ambiguous states. The nucleotide bugs are minor, relating to handling of 'X' The AA bugs are bigger, as we hardcode "N" as the ambiguous state however N = Asn = Asparagine.

Previously we would reconstruct ~any sequence in TreeTime and then correct the resulting sequences via the `character_map`. This was error prone, both due to inconsistent application of these corrections as well as not distinguishing between alphabets correctly (see failing tests in parent commit). Here we invert the logic so that we fully correct alignments and reference sequences before inference. This means all states in the alignment / ref are valid; ambiguous states such as "R" (nuc) and "Z" (aa) are valid.

These arguments apply to both nuc & aa reconstructions

There's been a tension between the two ways we handle translations in augur: the "old" way of inferring ancestral seqs then translating them vs the "new" way of using nextclade to get translated tip-sequences and reconstructing nucleotide & aa seqs independently across the tree. The new way is arguable better, as nextclade has some alignment heuristics around CDSs to produce better translations, but opens the door to having mismatches between the inferred nuc seq and the corresponding AA residue. These differences would be surfaced in Auspice when looking at branch mutations.This commit adds an optional warning to check for such mismatches. A trial n=5.5k H3N2 HA dataset had the following differences: - SigPep: 2 terminal nodes. Median residue mismatch count: 13 (range: 1 - 13) - HA1: 2 terminal nodes and 4 internal nodes. Median residue mismatch count: 3 (range: 1 - 5) - HA2: 1 terminal node and 1 internal node were different. Median residue mismatch count: 2 (range: 1 - 2)

jameshadfield added 4 commits March 24, 2026 15:26

[ancestral] refactor, add types

c6fe267

In preparation for allowing protein-only reconstructions

[ancestral] add tests to describe reconstruction bugs

4542a74

Tests describe bugs relating to our handling of ambiguous states. The nucleotide bugs are minor, relating to handling of 'X' The AA bugs are bigger, as we hardcode "N" as the ambiguous state however N = Asn = Asparagine.

[ancestral] update argument messaging

3882158

These arguments apply to both nuc & aa reconstructions

jameshadfield added priority: high To be resolved before other issues and removed priority: high To be resolved before other issues labels Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancestral reconstruction bug fixes#1975

Ancestral reconstruction bug fixes#1975
jameshadfield wants to merge 5 commits intomasterfrom
james/ancestral-bug-fixes

jameshadfield commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jameshadfield commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jameshadfield commented Mar 24, 2026 •

edited

Loading