MBZUAI-Paris/MoroccanWikipedia-QA

Name: MBZUAI-Paris/MoroccanWikipedia-QA
Creator: MBZUAI-Paris
Published: 2024-09-27 06:31:43
License: 暂无描述

Hugging Face2024-09-27 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/MBZUAI-Paris/MoroccanWikipedia-QA

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - text2text-generation - text-generation - question-answering --- # Dataset Card for MoroccanWikipedia-QA (MW-QA) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [https://hf.co/datasets/MBZUAI-Paris/MoroccanWikipedia-QA](https://hf.co/datasets/MBZUAI-Paris/MoroccanWikipedia-QA) - **Repository:** [https://github.com/MBZUAI-Paris/lm-evaluation-harness-Atlas-Chat](https://github.com/MBZUAI-Paris/lm-evaluation-harness-Atlas-Chat) - **Paper:** [More Information Needed] ### Dataset Summary MoroccanWikipedia-QA (MW-QA) is a dataset derived from the Moroccan Wikipedia dump to enhance the question-answering (QA) capabilities of language models in Moroccan Darija. The dataset is divided into four tasks: Open QA (8%), Multiple-Choice QA (40%) (MMLU-alike), Extractive QA (10%), and Multiple-Choice Extractive QA (42%) (Belebele-alike). The dataset is tailored to improve LLMs in both simple and complex QA tasks, providing diverse forms of questioning with and without context. ### Supported Tasks - **Task Category:** - Open question answering - Multiple-choice question answering - Extractive question answering - Multiple-choice extractive answering - **Task:** - Answering multiple-choice questions in Darija. - Answering open questions based on the content. - Extracting answers from context. ### Languages The dataset is available in Moroccan Arabic (Darija). ## Dataset Structure The dataset consists of QA pairs generated from 8,730 Moroccan Wikipedia pages, categorized into four QA types. Each type uses a different percentage of the total Wikipedia pages. ### Data Instances Example: ``` { "id": 4750 "task": "multichoice_extractive_qa" "question": "شكون لي كتر ف إطورا ف 2014؟", "context": "عدد السكان ديال إطورا تزاد ب 12.0% و عدد لفاميلات تزاد ب 14.8% مابين 2004 و 2014. ف 2014، عدد لبالغين كان 290 واحد، منهوم 141 دكور و 149 نتوات.", "choices": [النسا, الرجال, متساويين, ما كاينش معلومات كافية], "answer": 0, "meta_data": {'page_id': 65754, 'page_title': 'إطورا (تكضيشت)', 'url': 'https://ary.wikipedia.org/?curid=65754'}, } ``` ## Dataset Creation ### Curation Rationale This dataset was created to improve question-answering performance in Moroccan Darija and to develop a well-rounded benchmark for various QA tasks. ### Source Data #### Initial Data Collection and Normalization The data was collected from the Moroccan Wikipedia dump, processed using Claude 3.5 Sonnet, which generated QA pairs across four task categories using a mix of one-shot and two-shot prompting techniques. The pages were divided into different tasks based on their content type. #### Who are the source language producers? The source language producers are the original authors of Moroccan Wikipedia pages. The QA pairs were machine-generated using Claude 3.5 Sonnet and manually curated for quality control by the MBZUAI-Paris team. ### Annotations #### Annotation process The dataset was machine-generated and manually reviewed to ensure linguistic accuracy and appropriateness of the QA pairs. #### Who are the annotators? The annotations were machine-generated, with manual oversight by experts in Moroccan Darija. ### Personal and Sensitive Information The dataset does not contain personal or sensitive information. ## Considerations for Using the Data ### Social Impact of Dataset This dataset supports the development language models and question-answering systems in Moroccan Darija, which is underrepresented in NLP research, advancing the performance of LLMs for local and low-resource contexts. ### Discussion of Biases Since the dataset was translated using Claude Sonnet 3.5, it may inherit biases from it. Furthermore, cultural differences between the source and target languages might influence the difficulty or appropriateness of certain questions. ### Other Known Limitations - The dataset is limited to Moroccan Wikipedia content and does not cover domains beyond the dataset's source material. ## Additional Information ### Dataset Curators - MBZUAI-Paris team ### Licensing Information - [GNU Free Documentation License](https://github.com/IQAndreas/markdown-licenses/blob/master/gnu-fdl-v1.3.md) ### Citation Information ``` @article{shang2024atlaschatadaptinglargelanguage, title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing}, year={2024}, eprint={2409.17912}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.17912}, } ```

--- 任务类别: - 文本到文本生成 - 文本生成 - 问答 --- # 摩洛哥维基百科问答数据集（MoroccanWikipedia-QA，MW-QA）数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务](#supported-tasks) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集构建者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) ## 数据集描述 - **主页:** [https://hf.co/datasets/MBZUAI-Paris/MoroccanWikipedia-QA](https://hf.co/datasets/MBZUAI-Paris/MoroccanWikipedia-QA) - **代码仓库:** [https://github.com/MBZUAI-Paris/lm-evaluation-harness-Atlas-Chat](https://github.com/MBZUAI-Paris/lm-evaluation-harness-Atlas-Chat) - **论文:** [待补充] ### 数据集概述摩洛哥维基百科问答数据集（MoroccanWikipedia-QA，MW-QA）源自摩洛哥维基百科转储文件，旨在提升大语言模型（LLM）在摩洛哥达里贾语（Moroccan Darija）上的问答（QA）能力。该数据集包含四类问答任务：开放域问答（8%）、单项选择问答（40%，类MMLU格式）、抽取式问答（10%）以及单项选择抽取式问答（42%，类Belebele格式）。本数据集专为提升大语言模型在简单与复杂问答任务上的性能而设计，提供了带上下文与无上下文的多样化提问形式。 ### 支持任务 - **任务类别:** - 开放域问答 - 单项选择问答 - 抽取式问答 - 单项选择抽取式问答 - **任务:** - 使用达里贾语回答单项选择题。 - 基于给定内容回答开放域问题。 - 从上下文中抽取答案。 ### 语言本数据集使用摩洛哥阿拉伯语（达里贾语，Moroccan Darija）编写。 ## 数据集结构本数据集由8730个摩洛哥维基百科页面生成的问答对组成，被划分为四类问答任务类型，每类任务使用不同占比的维基百科页面。 ### 数据实例示例： { "id": 4750, "task": "multichoice_extractive_qa", "question": "شكون لي كتر ف إطورا ف 2014؟", "context": "عدد السكان ديال إطورا تزاد ب 12.0% و عدد لفاميلات تزاد ب 14.8% مابين 2004 و 2014. ف 2014، عدد لبالغين كان 290 واحد، منهوم 141 دكور و 149 نتوات.", "choices": [النسا, الرجال, متساويين, ما كاينش معلومات كافية], "answer": 0, "meta_data": {'page_id': 65754, 'page_title': 'إطورا (تكضيشت)', 'url': 'https://ary.wikipedia.org/?curid=65754'}, } ### 数据划分 ## 数据集构建 ### 数据集构建依据本数据集旨在提升大语言模型在摩洛哥达里贾语上的问答性能，并为各类问答任务构建一套全面的基准测试集。 ### 源数据 #### 初始数据收集与归一化数据源自摩洛哥维基百科转储文件，经Claude 3.5 Sonnet处理后生成四类任务类别的问答对，生成过程结合了单样本（one-shot）与两样本（two-shot）提示学习技术。数据集页面根据内容类型被划分至不同任务类别。 #### 源语言生产者是谁？源语言生产者为摩洛哥维基百科页面的原作者。问答对由Claude 3.5 Sonnet机器生成，并由MBZUAI-Paris团队进行人工质量校验与筛选。 ### 标注信息 #### 标注流程本数据集采用机器生成、人工复核的流程，以确保问答对的语言准确性与适用性。 #### 标注人员是谁？标注工作由机器生成，并由摩洛哥达里贾语专家进行人工监督。 ### 个人与敏感信息本数据集不包含任何个人或敏感信息。 ## 数据集使用注意事项 ### 数据集的社会影响本数据集支持摩洛哥达里贾语相关大语言模型与问答系统的开发——该语言在自然语言处理（NLP）研究中占比极低，本数据集可推动大语言模型在本地与低资源场景下的性能提升。 ### 偏差讨论由于本数据集通过Claude Sonnet 3.5生成，可能会继承其固有的偏差。此外，源语言与目标语言间的文化差异可能会影响部分问题的难度与适用性。 ### 其他已知局限性 - 本数据集仅涵盖摩洛哥维基百科的内容，未涉及源材料以外的其他领域。 ## 附加信息 ### 数据集构建者 - MBZUAI-Paris团队 ### 许可信息 - [GNU自由文档许可协议（GNU Free Documentation License）](https://github.com/IQAndreas/markdown-licenses/blob/master/gnu-fdl-v1.3.md) ### 引用信息 @article{shang2024atlaschatadaptinglargelanguage, title={"Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect"}, author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing}, year={2024}, eprint={2409.17912}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.17912}, }

提供机构：

MBZUAI-Paris

5,000+

优质数据集

54 个

任务类型

进入经典数据集