feat: add complete multi-language TN support for FR, ES, DE, ZH, HI, JA#7
feat: add complete multi-language TN support for FR, ES, DE, ZH, HI, JA#7Alex-Wengg merged 2 commits intomainfrom
Conversation
Achieves full feature and test parity across all 6 non-English languages for text normalization (TN). Each language now has all 10 TN taggers (cardinal, ordinal, decimal, money, date, time, measure, electronic, telephone, whitelist) with complete test coverage. Changes: - Add 66 new tagger modules across 6 languages (11 files × 6 languages) - Implement missing features for all languages: * Decimal quantity suffixes (billion/million/thousand equivalents) * Money scale suffixes for large amounts * Date decade parsing (e.g., "1980s") - Add 66 new tests across all languages for complete coverage - Update FFI and Swift bindings to support all 6 languages - Add multilang_tn_tests.rs for integration testing Test results: 741 tests passing (up from 675) Languages: - FR (French): milliard, million, mille - ES (Spanish): mil millones, millones, mil - DE (German): Milliarden, Millionen, Tausend - ZH (Mandarin): yi (亿), wan (万) - HI (Hindi): crore, lakh, hazaar - JA (Japanese): oku (億), man (万)
| let sep = if number_part.contains(',') && !number_part.contains('.') { | ||
| ',' | ||
| } else if number_part.contains('.') { | ||
| '.' | ||
| } else { |
There was a problem hiding this comment.
🟡 German cardinal parser treats dot as thousands separator, conflicting with decimal parser
The German cardinal parser (src/tts/de/cardinal.rs:25-27) accepts dots as thousands separators (e.g., "1.000" → "eintausend"). However, the German decimal parser (src/tts/de/decimal.rs:36-37) also accepts periods as decimal separators (e.g., "3.14" → "drei komma eins vier"). Because cardinal comes after decimal in the tagger priority order in tn_normalize_lang_de (src/lib.rs:229-234), the decimal parser will incorrectly claim ambiguous inputs like "2.025" (which in German is "zweitausendfuenfundzwanzig" with dot-as-thousands) and parse it as "zwei komma null zwei fuenf" instead. The cardinal test at src/tts/de/cardinal.rs:70-73 expects parse("2.025") to return Some("zweitausendfuenfundzwanzig"), but in sentence normalization, the decimal tagger (score 73) will match before cardinal (score 70) for the same input.
Was this helpful? React with 👍 or 👎 to provide feedback.
| if amount_str.contains(sep) && sep != '.' || amount_str.contains('.') { | ||
| let actual_sep = if amount_str.contains(',') { ',' } else { '.' }; |
There was a problem hiding this comment.
🟡 Operator precedence bug in money parse_amount condition produces confusing but accidentally correct logic
In src/tts/de/money.rs:175, src/tts/fr/money.rs:179, and src/tts/es/money.rs:178, the condition amount_str.contains(sep) && sep != '.' || amount_str.contains('.') is parsed by Rust as (amount_str.contains(sep) && sep != '.') || amount_str.contains('.') due to && binding tighter than ||. When sep is '.', this evaluates to (true && false) || true = true — which happens to be correct but only by coincidence. The real intent appears to be to always enter the decimal-handling branch when any separator is present. While the current behavior accidentally works, the redundant/misleading logic means sep is computed but never actually used — actual_sep on the next line re-computes the same thing. This is fragile and a maintenance hazard; a future change could easily break it.
| if amount_str.contains(sep) && sep != '.' || amount_str.contains('.') { | |
| let actual_sep = if amount_str.contains(',') { ',' } else { '.' }; | |
| if amount_str.contains(',') || amount_str.contains('.') { | |
| let actual_sep = if amount_str.contains(',') { ',' } else { '.' }; |
Was this helpful? React with 👍 or 👎 to provide feedback.
| match lang { | ||
| "fr" => tn_normalize_lang_fr(input), | ||
| "es" => tn_normalize_lang_es(input), | ||
| "de" => tn_normalize_lang_de(input), | ||
| "zh" => tn_normalize_lang_zh(input), | ||
| "hi" => tn_normalize_lang_hi(input), | ||
| "ja" => tn_normalize_lang_ja(input), | ||
| _ => tn_normalize(input), | ||
| } |
There was a problem hiding this comment.
🟡 tn_normalize_for_lang does not handle "en" explicitly, causing tn_normalize_lang("123", "en") to double-trim
tn_normalize_for_lang at src/lib.rs:123-134 does not match "en" explicitly — it falls through to the _ arm which calls tn_normalize(input). The tn_normalize function at src/lib.rs:570 calls input.trim() again on already-trimmed input (since tn_normalize_for_lang already trims at line 124). While the double-trim is harmless, the asymmetry with tn_normalize_sentence_with_max_span_lang (which does match "en" | "" explicitly at line 692) is inconsistent. More importantly, tn_normalize_sentence_with_max_span_lang treats "" as English but tn_normalize_for_lang treats "" as the _ fallback (also English), so behavior is consistent — but the empty-string case is not documented consistently.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Achieves full feature and test parity across all 6 non-English languages (French, Spanish, German, Chinese, Hindi, Japanese) for text normalization (TN). Each language now has all 10 TN taggers with complete test coverage.
Changes
New Features Implemented (all 6 languages)
New Files (66 total)
tests/multilang_tn_tests.rsLanguage-Specific Scale Words
Test Results
✅ 741 tests passing (up from 675)
Test Count by Language
Verification
All tests pass with no regressions.