Untranslatables in Taiwanese: A Corpus-Based Study of Semantic Leakage in AI Translations

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://doi.org/10.7910/DVN/0WM6HO

下载链接

链接失效反馈

官方服务：

资源简介：

This corpus-based study introduces a novel framework for examining semantic leakage in AI translations between Taiwanese Hokkien and Mandarin Chinese, addressing a critical gap in understanding AI performance with low-resource, non-standardized languages. Analyzing 26,046 sentences from the TAT_MOE Corpus (Liao, 2022) processed through ChatGPT-4o, we quantify and categorize untranslatable elements using our innovative hybrid methodology combining character-count analysis, back-translation divergence, and qualitative pattern identification. Results reveal that while 87.49% of translations exhibit minimal leakage, a significant 12.51% demonstrate moderate to severe semantic loss, challenging optimistic narratives about AI translation capabilities. Structural differences between the languages (average leakage: 19.21%) and phonological pattern loss (7.67%) emerge as the most potent sources of meaning distortion, surpassing cultural term loss (3.53%). We identify and meticulously analyze five distinct untranslatable patterns, with idiom/proverb changes (40.67%) and cultural term loss (10.52%) being particularly prevalent, though less impactful on leakage severity than structural issues. This research empirically demonstrates AI's limitations in processing the linguistic nuances of Taiwanese, provides a robust methodological framework for evaluating semantic fidelity in computational translation of linguistically marginalized languages, and offers crucial insights for translation theory, AI development, and language preservation efforts, highlighting the persistent need for human expertise in navigating complex linguistic landscapes.

创建时间：

2025-10-25