Skip to content

feat: add complete multi-language TN support for FR, ES, DE, ZH, HI, JA#7

Merged
Alex-Wengg merged 2 commits intomainfrom
feat/multi-language-tn-parity
Mar 12, 2026
Merged

feat: add complete multi-language TN support for FR, ES, DE, ZH, HI, JA#7
Alex-Wengg merged 2 commits intomainfrom
feat/multi-language-tn-parity

Conversation

@Alex-Wengg
Copy link
Member

@Alex-Wengg Alex-Wengg commented Mar 12, 2026

Summary

Achieves full feature and test parity across all 6 non-English languages (French, Spanish, German, Chinese, Hindi, Japanese) for text normalization (TN). Each language now has all 10 TN taggers with complete test coverage.

Changes

New Features Implemented (all 6 languages)

  • Decimal quantity suffixes: Support for scale words after decimals (e.g., "1.5 billion", "2.3万")
  • Money scale suffixes: Large amount notation (e.g., "¥2.5万", "$50 million")
  • Date decade parsing: Convert "1980s" to spoken form in each language

New Files (66 total)

  • 11 tagger modules per language × 6 languages
  • Integration test suite: tests/multilang_tn_tests.rs

Language-Specific Scale Words

  • FR: billiard, milliard, million, mille
  • ES: mil millones, millones, mil
  • DE: Milliarden, Millionen, Tausend
  • ZH: 亿 (yi), 万 (wan)
  • HI: crore, lakh, hazaar
  • JA: 億 (oku), 万 (man)

Test Results

741 tests passing (up from 675)

  • Added 66 new tests across all languages
  • All existing tests continue to pass

Test Count by Language

  • EN: 111 tests (baseline)
  • FR: 55 tests (+13 new)
  • ES: 55 tests (+11 new)
  • DE: 54 tests (+10 new)
  • ZH: 55 tests (+11 new)
  • HI: 55 tests (+11 new)
  • JA: 54 tests (+10 new)

Verification

cd text-processing-rs
cargo build
cargo test

All tests pass with no regressions.


Open with Devin

Achieves full feature and test parity across all 6 non-English languages
for text normalization (TN). Each language now has all 10 TN taggers
(cardinal, ordinal, decimal, money, date, time, measure, electronic,
telephone, whitelist) with complete test coverage.

Changes:
- Add 66 new tagger modules across 6 languages (11 files × 6 languages)
- Implement missing features for all languages:
  * Decimal quantity suffixes (billion/million/thousand equivalents)
  * Money scale suffixes for large amounts
  * Date decade parsing (e.g., "1980s")
- Add 66 new tests across all languages for complete coverage
- Update FFI and Swift bindings to support all 6 languages
- Add multilang_tn_tests.rs for integration testing

Test results: 741 tests passing (up from 675)

Languages:
- FR (French): milliard, million, mille
- ES (Spanish): mil millones, millones, mil
- DE (German): Milliarden, Millionen, Tausend
- ZH (Mandarin): yi (亿), wan (万)
- HI (Hindi): crore, lakh, hazaar
- JA (Japanese): oku (億), man (万)
@Alex-Wengg Alex-Wengg merged commit 959fe56 into main Mar 12, 2026
2 checks passed
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +34 to +38
let sep = if number_part.contains(',') && !number_part.contains('.') {
','
} else if number_part.contains('.') {
'.'
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 German cardinal parser treats dot as thousands separator, conflicting with decimal parser

The German cardinal parser (src/tts/de/cardinal.rs:25-27) accepts dots as thousands separators (e.g., "1.000""eintausend"). However, the German decimal parser (src/tts/de/decimal.rs:36-37) also accepts periods as decimal separators (e.g., "3.14""drei komma eins vier"). Because cardinal comes after decimal in the tagger priority order in tn_normalize_lang_de (src/lib.rs:229-234), the decimal parser will incorrectly claim ambiguous inputs like "2.025" (which in German is "zweitausendfuenfundzwanzig" with dot-as-thousands) and parse it as "zwei komma null zwei fuenf" instead. The cardinal test at src/tts/de/cardinal.rs:70-73 expects parse("2.025") to return Some("zweitausendfuenfundzwanzig"), but in sentence normalization, the decimal tagger (score 73) will match before cardinal (score 70) for the same input.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +175 to +176
if amount_str.contains(sep) && sep != '.' || amount_str.contains('.') {
let actual_sep = if amount_str.contains(',') { ',' } else { '.' };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Operator precedence bug in money parse_amount condition produces confusing but accidentally correct logic

In src/tts/de/money.rs:175, src/tts/fr/money.rs:179, and src/tts/es/money.rs:178, the condition amount_str.contains(sep) && sep != '.' || amount_str.contains('.') is parsed by Rust as (amount_str.contains(sep) && sep != '.') || amount_str.contains('.') due to && binding tighter than ||. When sep is '.', this evaluates to (true && false) || true = true — which happens to be correct but only by coincidence. The real intent appears to be to always enter the decimal-handling branch when any separator is present. While the current behavior accidentally works, the redundant/misleading logic means sep is computed but never actually used — actual_sep on the next line re-computes the same thing. This is fragile and a maintenance hazard; a future change could easily break it.

Suggested change
if amount_str.contains(sep) && sep != '.' || amount_str.contains('.') {
let actual_sep = if amount_str.contains(',') { ',' } else { '.' };
if amount_str.contains(',') || amount_str.contains('.') {
let actual_sep = if amount_str.contains(',') { ',' } else { '.' };
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +126 to +134
match lang {
"fr" => tn_normalize_lang_fr(input),
"es" => tn_normalize_lang_es(input),
"de" => tn_normalize_lang_de(input),
"zh" => tn_normalize_lang_zh(input),
"hi" => tn_normalize_lang_hi(input),
"ja" => tn_normalize_lang_ja(input),
_ => tn_normalize(input),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 tn_normalize_for_lang does not handle "en" explicitly, causing tn_normalize_lang("123", "en") to double-trim

tn_normalize_for_lang at src/lib.rs:123-134 does not match "en" explicitly — it falls through to the _ arm which calls tn_normalize(input). The tn_normalize function at src/lib.rs:570 calls input.trim() again on already-trimmed input (since tn_normalize_for_lang already trims at line 124). While the double-trim is harmless, the asymmetry with tn_normalize_sentence_with_max_span_lang (which does match "en" | "" explicitly at line 692) is inconsistent. More importantly, tn_normalize_sentence_with_max_span_lang treats "" as English but tn_normalize_for_lang treats "" as the _ fallback (also English), so behavior is consistent — but the empty-string case is not documented consistently.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant