UNNGGAH-UNGGUH
收藏arXiv2025-02-28 更新2025-03-04 收录
下载链接:
http://arxiv.org/abs/2502.20864v1
下载链接
链接失效反馈官方服务:
资源简介:
UNNGGAH-UNGGUH是一个专门为爪哇语设计的多文化敬语语料库,旨在捕捉Unggah-Ungguh Basa的细微差别,这是爪哇语中指导词汇和短语选择的言语礼仪框架,基于社会等级和语境。该数据集从多个著名来源精心挑选,包括词典和书籍,以确保对敬语用法的全面覆盖。数据集包含4024个句子,覆盖四个敬语级别:Ngoko、Ngoko Alus、Krama和Krama Alus。
UNNGGAH-UNGGUH is a multicultural honorific corpus specifically designed for the Javanese language, aiming to capture the nuances of Unggah-Ungguh Basa — the speech etiquette framework in Javanese that guides the selection of words and phrases based on social hierarchy and contextual settings. This dataset is meticulously curated from multiple reputable sources including dictionaries and books to ensure comprehensive coverage of honorific usage. It contains 4,024 sentences covering four honorific levels: Ngoko, Ngoko Alus, Krama, and Krama Alus.
提供机构:
Bandung Institute of Technology, Monash University Indonesia, Capital One, MBZUAI
创建时间:
2025-02-28
搜集汇总
数据集介绍

构建方式
UNGGAH-UNGGUH数据集的构建始于对爪哇语中敬语系统的深入研究,该系统根据说话者、听者和指代对象的社会地位而变化。数据集的创建采用了多种来源,包括爪哇语敬语词典和书籍,这些资源提供了丰富的语境。通过光学字符识别(OCR)和人工校对,研究人员从非数字化的词典中提取了例句,并排除了定义和其他词汇内容。此外,还从其他参考书籍中收集了一些例句,这些书籍提供了带有敬语水平解释的说明性句子。这些例句被系统地提取出来,并根据其敬语水平进行了标注。
使用方法
UNGGAH-UNGGUH数据集可用于评估语言模型(LMs)对爪哇语敬语的处理能力。研究人员可以使用这个数据集来训练和评估LMs在四个下游自然语言处理(NLP)任务上的表现:敬语水平分类、敬语风格转换、跨语言敬语翻译和带有敬语角色的对话生成。为了评估LMs的表现,研究人员使用了两种类型的模型:微调模型和现成模型。微调模型包括基于编码器的模型和基于解码器的模型,而现成模型包括英语中心模型、多语言模型、东南亚地区模型以及为印度尼西亚及其地方语言量身定制的模型。评估指标包括准确率、精确率、召回率和F1分数,以及BLEU和CHRF++分数,以衡量翻译质量。
背景与挑战
背景概述
Javanese, spoken by over 98 million people, possesses a unique and intricate honorific system known as Unggah-Ungguh Basa. This system, deeply rooted in cultural norms and historical traditions, is pivotal for conveying respect, social hierarchy, and formality in conversations. Despite its significance, there exists a paucity of comprehensive linguistic resources that accurately capture the nuances of this honorific system for natural language processing (NLP) applications. The paper presents UNGGAH-UNGGUH, a meticulously curated dataset designed to encapsulate the variations of Unggah-Ungguh Basa, the Javanese speech etiquette framework that dictates the choice of words and phrases based on social hierarchy and context. This dataset, the first of its kind for the Javanese language, is annotated with four honorific levels—Ngoko, Ngoko Alus, Krama, and Krama Alus—each representing different degrees of formality and respect. The UNGGAH-UNGGUH dataset is instrumental in advancing NLP research by providing a valuable resource for developing more accurate and culturally sensitive NLP models for the Javanese language, while also encouraging future research on other low-resource languages with similarly complex sociolinguistic structures.
当前挑战
The primary challenge in the Javanese honorific system lies in the complexity and variability of its honorific levels. Current language models (LMs) struggle to accurately interpret and generate Javanese honorifics due to the absence of a well-annotated corpus. Moreover, most existing Javanese corpora exhibit an imbalanced distribution of honorific levels, further limiting model performance. This imbalance is a significant hurdle as language models increasingly serve as personal assistants across various domains, requiring the ability to adapt to user expectations and maintain perceived status and formality. The UNGGAH-UNGGUH dataset aims to bridge this gap by providing a balanced and diverse corpus for training and evaluating LMs. However, the dataset's size, currently at 4,024 sentences, may not capture the full complexity of Javanese honorifics, necessitating further expansion with more diverse sources, including spoken language. Additionally, the dataset does not account for regional dialects of Javanese, limiting model accuracy in different linguistic regions. Future work should include dialectal variations for broader coverage. Ethically, misclassification of honorifics can result in disrespectful interactions within Javanese social contexts, highlighting the need for users of these models to be aware of their limitations in handling culturally sensitive language features.
常用场景
经典使用场景
UNGGAH-UNGGUH数据集为自然语言处理(NLP)任务提供了一个宝贵的资源,特别是对于理解和使用爪哇语中的敬语系统。该数据集详细标注了爪哇语中四个主要敬语级别(Ngoko、Ngoko Alus、Krama 和 Krama Alus)的句子,使其成为评估语言模型(LM)理解和生成适当敬语的重要基准。研究人员可以利用UNGGAH-UNGGUH进行敬语级别分类、敬语风格转换、跨语言敬语翻译和敬语对话生成等任务,从而深入探讨LMs在处理复杂语言文化特征方面的能力。
解决学术问题
UNGGAH-UNGGUH数据集解决了当前NLP模型在理解和生成爪哇语敬语方面的局限性。由于缺乏适当的标注语料库,现有模型在处理爪哇语敬语方面表现不佳,这限制了能够处理敬语复杂性的有效NLP工具的发展。UNGGAH-UNGGUH提供了第一个多文化敬语语料库,为开发更准确、更具文化敏感性的NLP模型铺平了道路。此外,该数据集的平衡性和多样性使其成为研究低资源语言的宝贵资源,并鼓励对具有类似复杂社会语言结构的其他语言进行未来研究。
实际应用
UNGGAH-UNGGUH数据集在实际应用中具有重要意义,特别是在需要处理爪哇语敬语的社会互动场景中。例如,在开发个人助理或聊天机器人时,UNGGAH-UNGGUH可以帮助模型根据用户的社会地位和语境生成适当的敬语。此外,UNGGAH-UNGGUH可以用于教育和培训目的,帮助学习者更好地理解和使用爪哇语敬语。通过提高模型处理敬语的能力,UNGGAH-UNGGUH有助于促进更自然、更具文化敏感性的语言交互,这对于在多元文化环境中保持恰当的社交互动至关重要。
数据集最近研究
最新研究方向
UNGGAH-UNGGUH数据集的最新研究方向在于评估语言模型处理爪哇语敬语的能力。该研究通过分类和机器翻译任务,探讨了当前语言模型在处理不同层次爪哇语敬语时的表现。研究结果表明,现有语言模型在处理大多数敬语级别时存在困难,表现出对某些敬语级别的偏好。此外,研究还探索了语言模型在对话任务中生成符合语境的爪哇语敬语的能力。这些发现揭示了当前语言模型在处理复杂敬语系统方面的局限性,为未来低资源语言的NLP研究和开发提供了重要的参考。
相关研究论文
- 1Do Language Models Understand Honorific Systems in Javanese?Bandung Institute of Technology, Monash University Indonesia, Capital One, MBZUAI · 2025年
以上内容由遇见数据集搜集并总结生成



