five

Echoes of Vagueness: A Corpus-Based Study of Semantic Ambiguity in Hakka AI Translation

收藏
DataONE2025-05-30 更新2025-11-01 收录
下载链接:
https://search.dataone.org/view/sha256:1c7f9bb9252196ab3fcc2a61a4f1e24a66d8f2473a682612947b6189f30177c7
下载链接
链接失效反馈
官方服务:
资源简介:
This study explores how large language models (LLMs), specifically GPT-4o, handle semantic ambiguity in low-resource languages, focusing on Hakka (Sfi-Hsien dialect). Unlike previous studies on Taiwanese which emphasize semantic leakage, this paper investigates how LLMs interpret and resolve lexical polysemy, context-dependent meanings, and pragmatically underspecified expressions during Hakka-to-Mandarin AI translation. We introduce the notion of Ambiguity Resolution Trajectories (ART) to trace whether ambiguity is preserved, disambiguated, distorted, or newly generated through back-translation. Our corpus, drawn from the Hakka Language Certification Vocabulary Database, was translated and back-translated using GPT-4o. Through a combined framework of entropy-based stylometrics, embedding divergence, and qualitative content analysis, we categorize ambiguity phenomena and assess AI's pragmatic decision-making. Findings reveal systematic biases in how GPT-4o resolves or simplifies ambiguity, with implications for translation studies, computational pragmatics, and low-resource language equity.

本研究探讨大语言模型(Large Language Model,LLMs)——具体为GPT-4o——如何处理低资源语言中的语义歧义,研究聚焦客家语(Sfi-Hsien方言)。不同于以往针对台语且强调语义泄露的相关研究,本文旨在探究大语言模型在客家语-普通话人工智能翻译过程中,如何解读并解决词汇多义性、语境依赖语义以及语用欠明表达式。我们引入“歧义消解轨迹(Ambiguity Resolution Trajectories,ART)”这一概念,用以追踪歧义在回译过程中是被保留、消解、扭曲还是新生成的。本研究的语料库取自客家语认证词汇数据库,通过GPT-4o完成翻译与回译流程。我们结合基于熵的风格计量学、嵌入差异以及定性内容分析的综合分析框架,对歧义现象进行分类,并评估人工智能的语用决策过程。研究结果显示,GPT-4o在消解或简化歧义的过程中存在系统性偏差,该发现对翻译研究、计算语用学以及低资源语言公平性研究均具有重要借鉴意义。
创建时间:
2025-10-29
二维码
社区交流群
二维码
科研交流群
商业服务