Echoes of Vagueness: A Corpus-Based Study of Semantic Ambiguity in Hakka AI Translation

DataONE2025-05-30 更新2025-11-01 收录

下载链接：

https://search.dataone.org/view/sha256:1c7f9bb9252196ab3fcc2a61a4f1e24a66d8f2473a682612947b6189f30177c7

下载链接

链接失效反馈

官方服务：

资源简介：

This study explores how large language models (LLMs), specifically GPT-4o, handle semantic ambiguity in low-resource languages, focusing on Hakka (Sﬁ-Hsien dialect). Unlike previous studies on Taiwanese which emphasize semantic leakage, this paper investigates how LLMs interpret and resolve lexical polysemy, context-dependent meanings, and pragmatically underspecified expressions during Hakka-to-Mandarin AI translation. We introduce the notion of Ambiguity Resolution Trajectories (ART) to trace whether ambiguity is preserved, disambiguated, distorted, or newly generated through back-translation. Our corpus, drawn from the Hakka Language Certification Vocabulary Database, was translated and back-translated using GPT-4o. Through a combined framework of entropy-based stylometrics, embedding divergence, and qualitative content analysis, we categorize ambiguity phenomena and assess AI's pragmatic decision-making. Findings reveal systematic biases in how GPT-4o resolves or simplifies ambiguity, with implications for translation studies, computational pragmatics, and low-resource language equity.

本研究探讨大语言模型（Large Language Model，LLMs）——具体为GPT-4o——如何处理低资源语言中的语义歧义，研究聚焦客家语（Sﬁ-Hsien方言）。不同于以往针对台语且强调语义泄露的相关研究，本文旨在探究大语言模型在客家语-普通话人工智能翻译过程中，如何解读并解决词汇多义性、语境依赖语义以及语用欠明表达式。我们引入“歧义消解轨迹（Ambiguity Resolution Trajectories，ART）”这一概念，用以追踪歧义在回译过程中是被保留、消解、扭曲还是新生成的。本研究的语料库取自客家语认证词汇数据库，通过GPT-4o完成翻译与回译流程。我们结合基于熵的风格计量学、嵌入差异以及定性内容分析的综合分析框架，对歧义现象进行分类，并评估人工智能的语用决策过程。研究结果显示，GPT-4o在消解或简化歧义的过程中存在系统性偏差，该发现对翻译研究、计算语用学以及低资源语言公平性研究均具有重要借鉴意义。

创建时间：

2025-10-29