feat: add complete multi-language TN support for FR, ES, DE, ZH, HI, JA by Alex-Wengg · Pull Request #7 · FluidInference/text-processing-rs

Alex-Wengg · 2026-03-12T19:38:31Z

Summary

Achieves full feature and test parity across all 6 non-English languages (French, Spanish, German, Chinese, Hindi, Japanese) for text normalization (TN). Each language now has all 10 TN taggers with complete test coverage.

Changes

New Features Implemented (all 6 languages)

Decimal quantity suffixes: Support for scale words after decimals (e.g., "1.5 billion", "2.3万")
Money scale suffixes: Large amount notation (e.g., "¥2.5万", "$50 million")
Date decade parsing: Convert "1980s" to spoken form in each language

New Files (66 total)

11 tagger modules per language × 6 languages
Integration test suite: tests/multilang_tn_tests.rs

Language-Specific Scale Words

FR: billiard, milliard, million, mille
ES: mil millones, millones, mil
DE: Milliarden, Millionen, Tausend
ZH: 亿 (yi), 万 (wan)
HI: crore, lakh, hazaar
JA: 億 (oku), 万 (man)

Test Results

✅ 741 tests passing (up from 675)

Added 66 new tests across all languages
All existing tests continue to pass

Test Count by Language

EN: 111 tests (baseline)
FR: 55 tests (+13 new)
ES: 55 tests (+11 new)
DE: 54 tests (+10 new)
ZH: 55 tests (+11 new)
HI: 55 tests (+11 new)
JA: 54 tests (+10 new)

Verification

cd text-processing-rs
cargo build
cargo test

All tests pass with no regressions.

Achieves full feature and test parity across all 6 non-English languages for text normalization (TN). Each language now has all 10 TN taggers (cardinal, ordinal, decimal, money, date, time, measure, electronic, telephone, whitelist) with complete test coverage. Changes: - Add 66 new tagger modules across 6 languages (11 files × 6 languages) - Implement missing features for all languages: * Decimal quantity suffixes (billion/million/thousand equivalents) * Money scale suffixes for large amounts * Date decade parsing (e.g., "1980s") - Add 66 new tests across all languages for complete coverage - Update FFI and Swift bindings to support all 6 languages - Add multilang_tn_tests.rs for integration testing Test results: 741 tests passing (up from 675) Languages: - FR (French): milliard, million, mille - ES (Spanish): mil millones, millones, mil - DE (German): Milliarden, Millionen, Tausend - ZH (Mandarin): yi (亿), wan (万) - HI (Hindi): crore, lakh, hazaar - JA (Japanese): oku (億), man (万)

devin-ai-integration

Devin Review found 3 potential issues.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-03-12T19:42:47Z

src/tts/de/decimal.rs

+    let sep = if number_part.contains(',') && !number_part.contains('.') {
+        ','
+    } else if number_part.contains('.') {
+        '.'
+    } else {


🟡 German cardinal parser treats dot as thousands separator, conflicting with decimal parser

The German cardinal parser (src/tts/de/cardinal.rs:25-27) accepts dots as thousands separators (e.g., "1.000" → "eintausend"). However, the German decimal parser (src/tts/de/decimal.rs:36-37) also accepts periods as decimal separators (e.g., "3.14" → "drei komma eins vier"). Because cardinal comes after decimal in the tagger priority order in tn_normalize_lang_de (src/lib.rs:229-234), the decimal parser will incorrectly claim ambiguous inputs like "2.025" (which in German is "zweitausendfuenfundzwanzig" with dot-as-thousands) and parse it as "zwei komma null zwei fuenf" instead. The cardinal test at src/tts/de/cardinal.rs:70-73 expects parse("2.025") to return Some("zweitausendfuenfundzwanzig"), but in sentence normalization, the decimal tagger (score 73) will match before cardinal (score 70) for the same input.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-12T19:42:48Z

src/tts/de/money.rs

+    if amount_str.contains(sep) && sep != '.' || amount_str.contains('.') {
+        let actual_sep = if amount_str.contains(',') { ',' } else { '.' };


🟡 Operator precedence bug in money parse_amount condition produces confusing but accidentally correct logic

In src/tts/de/money.rs:175, src/tts/fr/money.rs:179, and src/tts/es/money.rs:178, the condition amount_str.contains(sep) && sep != '.' || amount_str.contains('.') is parsed by Rust as (amount_str.contains(sep) && sep != '.') || amount_str.contains('.') due to && binding tighter than ||. When sep is '.', this evaluates to (true && false) || true = true — which happens to be correct but only by coincidence. The real intent appears to be to always enter the decimal-handling branch when any separator is present. While the current behavior accidentally works, the redundant/misleading logic means sep is computed but never actually used — actual_sep on the next line re-computes the same thing. This is fragile and a maintenance hazard; a future change could easily break it.

Suggested change

if amount_str.contains(sep) && sep != '.' || amount_str.contains('.') {

let actual_sep = if amount_str.contains(',') { ',' } else { '.' };

if amount_str.contains(',') || amount_str.contains('.') {

let actual_sep = if amount_str.contains(',') { ',' } else { '.' };

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-12T19:42:50Z

src/lib.rs

+    match lang {
+        "fr" => tn_normalize_lang_fr(input),
+        "es" => tn_normalize_lang_es(input),
+        "de" => tn_normalize_lang_de(input),
+        "zh" => tn_normalize_lang_zh(input),
+        "hi" => tn_normalize_lang_hi(input),
+        "ja" => tn_normalize_lang_ja(input),
+        _ => tn_normalize(input),
+    }


🟡 tn_normalize_for_lang does not handle "en" explicitly, causing tn_normalize_lang("123", "en") to double-trim

tn_normalize_for_lang at src/lib.rs:123-134 does not match "en" explicitly — it falls through to the _ arm which calls tn_normalize(input). The tn_normalize function at src/lib.rs:570 calls input.trim() again on already-trimmed input (since tn_normalize_for_lang already trims at line 124). While the double-trim is harmless, the asymmetry with tn_normalize_sentence_with_max_span_lang (which does match "en" | "" explicitly at line 692) is inconsistent. More importantly, tn_normalize_sentence_with_max_span_lang treats "" as English but tn_normalize_for_lang treats "" as the _ fallback (also English), so behavior is consistent — but the empty-string case is not documented consistently.

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg added 2 commits March 12, 2026 15:35

style: apply cargo fmt to all modified files

ed7665f

Alex-Wengg merged commit 959fe56 into main Mar 12, 2026
2 checks passed

devin-ai-integration bot reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add complete multi-language TN support for FR, ES, DE, ZH, HI, JA#7

feat: add complete multi-language TN support for FR, ES, DE, ZH, HI, JA#7
Alex-Wengg merged 2 commits intomainfrom
feat/multi-language-tn-parity

Alex-Wengg commented Mar 12, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Uh oh!

devin-ai-integration bot Mar 12, 2026

Uh oh!

devin-ai-integration bot Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if amount_str.contains(sep) && sep != '.' \|\| amount_str.contains('.') {
		let actual_sep = if amount_str.contains(',') { ',' } else { '.' };

Conversation

Alex-Wengg commented Mar 12, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Features Implemented (all 6 languages)

New Files (66 total)

Language-Specific Scale Words

Test Results

Test Count by Language

Verification

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Mar 12, 2026 •

edited by devin-ai-integration bot

Loading