five

LoraxBench

收藏
魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/google/LoraxBench
下载链接
链接失效反馈
官方服务:
资源简介:
# LoraxBench: A Benchmark for Indonesian Local Languages and Registers ## Dataset Summary LoraxBench is a comprehensive multilingual benchmark focusing on Indonesian and 19 Indonesian local languages, covering 6 diverse NLP tasks. It includes multiple registers for select languages, emphasizing the impact of formal and casual speech on model performance. LoraxBench is professionally translated and validated by natives, and were sourced from Indonesian-originated dataset, Our data is sourced from Indonesian-originated content, thus capturing local nuances better than English-centric data. LoraxBench fills a critical gap in NLP for Indonesia’s linguistic diversity, where over 700 languages are spoken but few resources exist. Beyond Indonesia, it serves as a valuable resource for modeling challenges common in linguistically rich, resource-scarce regions worldwide. ## Languages The dataset covers the following 20 languages: | Language | ISO Code | Approx. Speakers (millions) | Region | |-------------------|----------|-----------------------------|--------------------| | Acehnese | ace | 3.7 | Aceh | | Ambonese Malay | abs | 0.2 | Ambon | | Balinese | ban | 4.8 | Bali | | Banjar | bjn | 4.0 | South Sulawesi | | Batak Toba | bbc | 2.5 | North Sumatra | | Betawi | bew | 5.6 | Jakarta | | Buginese | bug | 4.3 | South Sulawesi | | Gorontalo | gor | 1.1 | Gorontalo | | Iban | iba | 0.8 | West Kalimantan | | Jambi Malay | jax | 1.0 | Jambi | | Javanese | jv | 91.0 | East/Central Java | | Lampung Nyo | abl | 1.5 | Lampung | | Madurese | mad | 17.0 | East Java | | Makasar | mak | 1.9 | Makasar | | Minangkabau | min | 8.0 | West Sumatra | | Musi | mui | 3.1 | South Sumatra | | Ngaju | nij | 0.9 | Central Kalimantan | | Sasak | sas | 2.6 | West Nusa Tenggara | | Sundanese | su | 32.0 | West Java | | Indonesian | id | > 170.0 | Indonesia | ## Registers Included For three languages, LoraxBench includes two distinct registers capturing different levels of formality: | Language | Formal Register | Casual Register | |-----------|-----------------|-----------------| | Javanese | Krama | Ngoko | | Sundanese | Lemes | Loma | | Madurese | Engghi Ethen | Enja’Iya | Formal registers are used in respectful or formal contexts; casual registers are used among peers and friends, showing significant lexical and stylistic differences. ## Tasks and Data Sources The following are tasks covered in LoraxBench ### Reading Comprehension Answering questions based on Indonesian text passages. This data is translated from the [TyDi QA](https://huggingface.co/datasets/tydiqa) secondary, Indonesian subset. ### Open-Domain Question Answering Answering questions without access to context passages. This data is derived from the [TyDi QA](https://huggingface.co/datasets/tydiqa) secondary, Indonesian subset. ### Natural Language Inference (NLI) Determining entailment, contradiction, or neutrality between sentence pairs. This data is translated from the test-expert subset of [IndoNLI](https://huggingface.co/datasets/afaji/indonli), specifically on single-sentence sets. ### Causal Reasoning Reasoning about cause-effect relations in text. This data is translated from locally-nuanced causal reasoning data, [COPAL-ID](https://huggingface.co/datasets/haryoaw/COPAL). We have filtered some of the entries that are too Jakartan-specific. ### Machine Translation Translating text to Indonesian. This data is taken from IndoNLI premises, which itself originated from various webpages, news, and articles. ### Cultural Question Answering Answering culturally relevant questions about Indonesia. We source this from [IndoCulture](https://huggingface.co/datasets/indolem/IndoCulture), with further filtering and clean-up. Specifically, we change some of the distractors that were deemed obviously wrong, fix some typos and writing inconsistencies, as well as remove some trivially easy questions. More on this in the paper. ## Personal and Sensitive Information The corpora contain no personal or sensitive information. Data was sourced and translated with respect to privacy and ethical guidelines. ## Additional Information - LoraxBench exposes challenges for multilingual models in low-resource and register-variant settings. - Benchmark results highlight performance gaps between Indonesian, local languages, and registers. ## Dataset Curators Google Research ## Licensing Information This project is licensed under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). ## Citation Information Please cite the following papers when using this dataset: - The main LoraxBench paper - Clark et al., 2020. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. - Mahendra et al., 2021. IndoNLI: A Natural Language Inference Dataset for Indonesian. - Wibowo et al., 2024. COPAL-ID: Causal Reasoning in Indonesian. - Koto et al., 2024. IndoCulture: Cultural Question Answering in Indonesia. - Cahyawijaya et al., 2023. NusaCrowd: Indonesian NLP Dataset Collection.

# LoraxBench:面向印尼本土语言与语域的评测基准 ## 数据集概述 LoraxBench是一款全面的多语言评测基准,聚焦印尼语与19种印尼本土语言,涵盖6类多样化的自然语言处理(Natural Language Processing,简称NLP)任务。该基准为部分语言设置了多类语域,着重探究正式与非正式口语对模型性能的影响。LoraxBench的文本均由母语者进行专业翻译与校验,数据源自印尼本土内容,因此比以英语为中心的数据集更能捕捉本土语言的细微差异。 LoraxBench填补了印尼语言多样性背景下自然语言处理领域的关键空白——印尼境内使用的语言超过700种,但相关资源却寥寥无几。除印尼外,该基准还可为全球语言丰富但资源匮乏地区普遍存在的建模挑战提供宝贵的研究资源。 ## 覆盖语言 本数据集涵盖以下20种语言: | 语言名称 | ISO代码 | 使用人数近似值(百万) | 分布区域 | |-------------------|----------|-----------------------------|--------------------| | 亚齐语 | ace | 3.7 | 亚齐省 | | 安汶马来语 | abs | 0.2 | 安汶岛 | | 巴厘语 | ban | 4.8 | 巴厘省 | | 班查语 | bjn | 4.0 | 南苏拉威西省 | | 托巴巴塔克语 | bbc | 2.5 | 北苏门答腊省 | | 贝塔维语 | bew | 5.6 | 雅加达 | | 布吉语 | bug | 4.3 | 南苏拉威西省 | | 哥伦打洛语 | gor | 1.1 | 哥伦打洛省 | | 伊班语 | iba | 0.8 | 西加里曼丹省 | | 占碑马来语 | jax | 1.0 | 占碑省 | | 爪哇语 | jv | 91.0 | 东/中爪哇省 | | 楠榜语 | abl | 1.5 | 楠榜省 | | 马都拉语 | mad | 17.0 | 东爪哇省 | | 望加锡语 | mak | 1.9 | 望加锡地区 | | 米南加保语 | min | 8.0 | 西苏门答腊省 | | 穆西河语 | mui | 3.1 | 南苏门答腊省 | | 恩加朱语 | nij | 0.9 | 中加里曼丹省 | | 萨萨克语 | sas | 2.6 | 西努沙登加拉省 | | 巽他语 | su | 32.0 | 西爪哇省 | | 印尼语 | id | > 170.0 | 印尼全境 | ## 包含语域 针对三种语言,LoraxBench设置了两类区分正式程度的语域: | 语言名称 | 正式语域 | 非正式语域 | |-----------|-----------------|-----------------| | 爪哇语 | Krama | Ngoko | | 巽他语 | Lemes | Loma | | 马都拉语 | Engghi Ethen | Enja’Iya | 正式语域用于恭敬或正式场合,非正式语域则用于同辈或友人之间,二者在词汇与文体上存在显著差异。 ## 评测任务与数据来源 LoraxBench涵盖以下评测任务: ### 阅读理解 基于印尼语文本段落回答问题。该数据集源自[TyDi QA](https://huggingface.co/datasets/tydiqa)的印尼语二级子集,并经翻译适配。 ### 开放域问答 无需参考上下文段落即可回答问题。该数据集源自[TyDi QA](https://huggingface.co/datasets/tydiqa)的印尼语二级子集。 ### 自然语言推理(NLI) 判断句对之间的蕴含、矛盾或中立关系。该数据集源自[IndoNLI](https://huggingface.co/datasets/afaji/indonli)的测试专家子集,且仅包含单句集合数据,并经翻译适配。 ### 因果推理 推理文本中的因果关系。该数据集源自具备本土语言细微差异的因果推理数据集[COPAL-ID](https://huggingface.co/datasets/haryoaw/COPAL),我们已过滤掉部分过于偏向雅加达地区的条目,并完成翻译适配。 ### 机器翻译 将文本翻译为印尼语。该数据集源自IndoNLI的前提句,而IndoNLI的原始数据来自各类网页、新闻与文章。 ### 文化问答 回答与印尼文化相关的问题。该数据集源自[IndoCulture](https://huggingface.co/datasets/indolem/IndoCulture),并经过进一步的筛选与清理:我们修改了部分明显不合理的干扰项,修正了部分拼写错误与文体不一致问题,同时移除了部分过于简单的问题。更多细节详见论文。 ## 个人与敏感信息 本数据集未包含任何个人或敏感信息。数据的采集与翻译均遵循隐私与伦理准则。 ## 补充说明 - LoraxBench可暴露多语言模型在低资源与语域变体场景下的挑战。 - 该基准的评测结果清晰展现了印尼语、本土语言与不同语域之间的性能差距。 ## 数据集制作方 谷歌研究院(Google Research) ## 授权信息 本项目采用[知识共享署名4.0国际许可协议(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)进行授权。 ## 引用信息 使用本数据集时,请引用以下文献: - LoraxBench主论文 - Clark et al., 2020. *TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages*. - Mahendra et al., 2021. *IndoNLI: A Natural Language Inference Dataset for Indonesian*. - Wibowo et al., 2024. *COPAL-ID: Causal Reasoning in Indonesian*. - Koto et al., 2024. *IndoCulture: Cultural Question Answering in Indonesia*. - Cahyawijaya et al., 2023. *NusaCrowd: Indonesian NLP Dataset Collection*.
提供机构:
maas
创建时间:
2025-08-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作