Open
Conversation
In preparation for allowing protein-only reconstructions
Tests describe bugs relating to our handling of ambiguous states. The nucleotide bugs are minor, relating to handling of 'X' The AA bugs are bigger, as we hardcode "N" as the ambiguous state however N = Asn = Asparagine.
Previously we would reconstruct ~any sequence in TreeTime and then correct the resulting sequences via the `character_map`. This was error prone, both due to inconsistent application of these corrections as well as not distinguishing between alphabets correctly (see failing tests in parent commit). Here we invert the logic so that we fully correct alignments and reference sequences before inference. This means all states in the alignment / ref are valid; ambiguous states such as "R" (nuc) and "Z" (aa) are valid.
These arguments apply to both nuc & aa reconstructions
There's been a tension between the two ways we handle translations in augur: the "old" way of inferring ancestral seqs then translating them vs the "new" way of using nextclade to get translated tip-sequences and reconstructing nucleotide & aa seqs independently across the tree. The new way is arguable better, as nextclade has some alignment heuristics around CDSs to produce better translations, but opens the door to having mismatches between the inferred nuc seq and the corresponding AA residue. These differences would be surfaced in Auspice when looking at branch mutations.This commit adds an optional warning to check for such mismatches. A trial n=5.5k H3N2 HA dataset had the following differences: - SigPep: 2 terminal nodes. Median residue mismatch count: 13 (range: 1 - 13) - HA1: 2 terminal nodes and 4 internal nodes. Median residue mismatch count: 3 (range: 1 - 5) - HA2: 1 terminal node and 1 internal node were different. Median residue mismatch count: 2 (range: 1 - 2)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes reconstruction bugs in
augur ancestral, most notably when we would use the hardcoded 'N' character as the ambiguous state for AA sequences, and thus report erroneous mutations to N = Asn = Asparagine. See added tests for full details.These bugs were noticed during refactoring as part of #1958, but they are not a result of the refactor. I've cherry-picked them into a new PR as they are unrelated to that work.
I tested on a 5.5k H3N2 HA dataset and there were no occurrences of these bugs implying that they may be rare. (The worst would need ambiguous ("X") states in the translated tip sequences, which nextclade may never produce??)
Relatedly, I've always wanted to check reconstructed translations against what the reconstructed nucleotide sequence would translate to. The final commit indicates mismatches here in an opt-in fashion.