MBZUAI-Paris/DarijaHellaSwag

Name: MBZUAI-Paris/DarijaHellaSwag
Creator: MBZUAI-Paris
Published: 2024-09-27 06:34:03
License: 暂无描述

Hugging Face2024-09-27 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/MBZUAI-Paris/DarijaHellaSwag

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-translated language_creators: - machine-translated language: - ma license: - mit multilinguality: - translation size_categories: - 1K<n<10K source_datasets: - hellaswag task_categories: - question-answering task_ids: - multiple-choice-qa configs: - config_name: default data_files: - split: test path: data/test-* - split: validation path: data/validation-* - split: train path: data/train-* dataset_info: features: - name: ind dtype: int64 - name: activity_label dtype: string - name: ctx dtype: string - name: endings sequence: string - name: source_id dtype: string - name: split dtype: string - name: split_type dtype: string - name: label dtype: string splits: - name: test num_bytes: 12435114 num_examples: 10003 - name: validation num_bytes: 12851374 num_examples: 10042 - name: train num_bytes: 14507 num_examples: 10 download_size: 11530282 dataset_size: 25300995 --- # Dataset Card for DarijaHellaSwag ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [https://hf.co/datasets/MBZUAI-Paris/DarijaHellaSwag](https://hf.co/datasets/MBZUAI-Paris/DarijaHellaSwag) - **Repository:** [https://github.com/MBZUAI-Paris/lm-evaluation-harness-atlas-chat](https://github.com/MBZUAI-Paris/lm-evaluation-harness-atlas-chat) - **Paper:** [More Information Needed] ### Dataset Summary DarijaHellaSwag is a challenging multiple-choice benchmark designed to evaluate machine reading comprehension and commonsense reasoning in Moroccan Darija. It is a translated version of the HellaSwag validation set, which presents scenarios where models must choose the most plausible continuation of a passage from four options. ### Supported Tasks - **Task Category:** Multiple-choice question answering - **Task:** Answering multiple-choice questions in Darija, focusing on understanding nuanced language and contextual inference. ### Languages The dataset is available in Moroccan Arabic (Darija). ## Dataset Structure DarijaHellaSwag consists of multiple-choice questions, with four options provided for each scenario. ### Data Instances Each data instance includes: - **ind**: Unique index for the instance. - **activity_label**: A label representing the type of activity described. - **ctx**: A passage describing a scenario (context). - **endings**: A list of four possible continuations for the scenario. - **source_id**: Identifier for the original source. - **split**: The dataset split (train, validation, or test). - **split_type**: Specifies whether the instance is from the original or a derived set. - **label**: The correct continuation (index between 0-3). Example: ``` { "ind": 855, "activity_label": "الجري فماراطون", "ctx": "كاميرا كاتبعد على طريق و كاتبين رجلين ديال شي واحد كايتحركو. كاين بزاف ديال اللقطات كايبانو فيهم ناس كايربطو صبابطهم، كايضغطو على بوطون، و كايشوفو الساعة. الناس", "endings": [ "كايدورو فالبيت لابسين شواشي مضحكين، طاقيات، و صبابط.", "من بعد كايبانو كايغسلو طوموبيل.", "من بعد كايبانو كايجريو فالطريق واحد بواحد و من وراهم بزاف ديال الناس كايتبعوهم.", "كايمشيو فالطريق، كايركبو بيسكليتات و كايعزفو على الزمارات." ], "source_id": "activitynet~v_9PvtW0Uvnl0", "split": "val", "split_type": "zeroshot", "label": "2" } ``` ## Dataset Creation ### Curation Rationale This dataset was created to evaluate language models' ability to understand complex and commonsense scenarios in Moroccan Darija: a variety of Arabic underrepresented in NLP research. ### Source Data #### Initial Data Collection and Normalization The dataset is a translation of the **HellaSwag** dataset, which was originally created to test models' abilities in reading comprehension and commonsense reasoning. The translation was performed using Claude Sonnet 3.5. #### Who are the source language producers? The original HellaSwag dataset was produced by the authors of the paper "HellaSwag: Can a Machine Really Finish Your Sentence?" by Rowan Zellers et al. The Darija translation was generated using machine translation with manual oversight. ### Annotations #### Annotation process The dataset was translated from English into Moroccan Darija. #### Who are the annotators? The translations were generated using Claude Sonnet 3.5, and quality was assured through manual review by native Moroccan Darija speakers. ### Personal and Sensitive Information The dataset does not contain personal or sensitive information. ## Considerations for Using the Data ### Social Impact of Dataset The DarijaHellaSwag dataset promotes the development of models capable of understanding and reasoning in Moroccan Darija, contributing to NLP progress for this underrepresented and low-resource language (Darija). ### Discussion of Biases Since the dataset was translated using Claude 3.5 Sonnet, it may inherit biases from it. Furthermore, cultural differences between the source and target languages might influence the difficulty or appropriateness of certain questions. ### Other Known Limitations - The dataset is limited to the domains covered by the original HellaSwag benchmark. - Some nuances may be lost in translation, affecting the difficulty of certain questions. ## Additional Information ### Dataset Curators - MBZUAI-Paris team ### Licensing Information - [MIT License](https://github.com/MBZUAI-Paris/DarijaHellaSwag/blob/main/LICENSE) ### Citation Information ``` @article{shang2024atlaschatadaptinglargelanguage, title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing}, year={2024}, eprint={2409.17912}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.17912}, } ``` ``` @article{zellers2019hellaswag, title={HellaSwag: Can a Machine Really Finish Your Sentence?}, author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin}, journal={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2019} } ```

提供机构：

MBZUAI-Paris

5,000+

优质数据集

54 个

任务类型

进入经典数据集