AraDICE-HellaSwag

Name: AraDICE-HellaSwag
Creator: maas
Published: 2025-12-05 16:38:48
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/QCRI/AraDICE-HellaSwag

下载链接

链接失效反馈

官方服务：

资源简介：

# AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs ## Overview The **AraDiCE** dataset is designed to evaluate dialectal and cultural capabilities in large language models (LLMs). The dataset consists of post-edited versions of various benchmark datasets, curated for validation in cultural and dialectal contexts relevant to Arabic. In this repository, we present the HellaSwag split of the data.  ## Evaluation We have used [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) eval framework to for the benchmarking. We will soon release them. Stay tuned!! ## License The dataset is distributed under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)**. The full license text can be found in the accompanying `licenses_by-nc-sa_4.0_legalcode.txt` file. ## Citation Please find the paper <a href="https://aclanthology.org/2025.coling-main.283/" target="_blank" style="margin-right: 15px; margin-left: 10px">here.</a> ``` @inproceedings{mousi-etal-2025-aradice, title = "{A}ra{D}i{CE}: Benchmarks for Dialectal and Cultural Capabilities in {LLM}s", author = "Mousi, Basel and Durrani, Nadir and Ahmad, Fatema and Hasan, Md. Arid and Hasanain, Maram and Kabbani, Tameem and Dalvi, Fahim and Chowdhury, Shammur Absar and Alam, Firoj", editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven", booktitle = "Proceedings of the 31st International Conference on Computational Linguistics", month = jan, year = "2025", address = "Abu Dhabi, UAE", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.coling-main.283/", pages = "4186--4218", abstract = "Arabic, with its rich diversity of dialects, remains significantly underrepresented in Large Language Models, particularly in dialectal variations. We address this gap by introducing seven synthetic datasets in dialects alongside Modern Standard Arabic (MSA), created using Machine Translation (MT) combined with human post-editing. We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on dialect comprehension and generation, focusing specifically on low-resource Arabic dialects. Additionally, we introduce the first-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions, providing a novel dimension to LLM evaluation. Our findings demonstrate that while Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, significant challenges persist in dialect identification, generation, and translation. This work contributes {\ensuremath{\approx}}45K post-edited samples, a cultural benchmark, and highlights the importance of tailored training to improve LLM performance in capturing the nuances of diverse Arabic dialects and cultural contexts. We have released the dialectal translation models and benchmarks developed in this study (https://huggingface.co/datasets/QCRI/AraDiCE)" } ```

# AraDiCE：面向大语言模型（Large Language Model，LLM）的方言与文化能力评测基准 ## 概述 **AraDiCE** 数据集旨在评测大语言模型的方言与文化能力。该数据集由经过后编辑的各类基准数据集版本构成，专为适配阿拉伯语相关的文化与方言场景的验证任务而整理。本仓库中我们公开了该数据集的HellaSwag划分版本。  ## 评测我们使用了[lm-harness](https://github.com/EleutherAI/lm-evaluation-harness)评测框架开展基准测试。相关内容即将发布，敬请期待！ ## 许可本数据集采用**知识共享署名-非商业性使用-相同方式共享4.0国际许可协议（CC BY-NC-SA 4.0）**进行分发。完整许可文本可在配套的`licenses_by-nc-sa_4.0_legalcode.txt`文件中查看。 ## 引用论文详情可<a href="https://aclanthology.org/2025.coling-main.283/" target="_blank" style="margin-right: 15px; margin-left: 10px">在此处查阅</a>。 @inproceedings{mousi-etal-2025-aradice, title = "{A}ra{D}i{CE}：面向{大语言模型（LLM）}的方言与文化能力评测基准", author = "Mousi, Basel and Durrani, Nadir and Ahmad, Fatema and Hasan, Md. Arid and Hasanain, Maram and Kabbani, Tameem and Dalvi, Fahim and Chowdhury, Shammur Absar and Alam, Firoj", editor = "Rambow, Owen and Wanner, Leo and Apidianaki, Marianna and Al-Khalifa, Hend and Eugenio, Barbara Di and Schockaert, Steven", booktitle = "第31届国际计算语言学大会会议论文集", month = jan, year = "2025", address = "阿联酋阿布扎比", publisher = "国际计算语言学协会", url = "https://aclanthology.org/2025.coling-main.283/", pages = "4186--4218", abstract = "阿拉伯语拥有丰富的方言多样性，但在大语言模型（Large Language Model，LLM）中的代表性仍严重不足，尤其是在方言变体方面。为填补这一空白，我们结合机器翻译（Machine Translation，MT）与人工后编辑技术，构建了7个方言数据集以及现代标准阿拉伯语（Modern Standard Arabic，MSA）数据集。我们推出了AraDiCE——阿拉伯方言与文化评测基准。我们针对大语言模型的方言理解与生成能力开展评测，重点聚焦低资源阿拉伯方言。此外，我们首次提出了细粒度评测基准，用于评估海湾、埃及以及黎凡特地区的文化认知能力，为大语言模型评测提供了全新维度。研究结果表明，尽管Jais、AceGPT等阿拉伯专属模型在方言任务上优于多语言模型，但在方言识别、生成与翻译任务中仍存在显著挑战。本工作贡献了约4.5万个经过后编辑的样本、一个文化评测基准，并强调了定制化训练对于提升大语言模型捕捉多样阿拉伯方言与文化语境细微差别的性能的重要性。我们已公开本研究中开发的方言翻译模型与基准数据集（https://huggingface.co/datasets/QCRI/AraDiCE）" }

提供机构：

maas

创建时间：

2025-06-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集