MultiNativQA

Name: MultiNativQA
Creator: maas
Published: 2025-12-05 16:38:47
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/QCRI/MultiNativQA

下载链接

链接失效反馈

官方服务：

资源简介：

# MultiNativQA: Multilingual Culturally-Aligned Natural Queries For LLMs ### Overview The **MultiNativQA** dataset is a multilingual, native, and culturally aligned question-answering resource. It spans 7 languages, ranging from high- to extremely low-resource, and covers 9 different locations/cities. To capture linguistic diversity, the dataset includes several dialects for dialect-rich languages like Arabic. In addition to Modern Standard Arabic (MSA), **MultiNativQA** features six Arabic dialects — *Egyptian, Jordanian, Khaliji, Sudanese, Tunisian*, and *Yemeni*. The dataset also provides two linguistic variations of Bangla, reflecting differences between speakers in *Bangladesh* and *West Bengal, India*. Additionally, **MultiNativQA** includes English queries from *Dhaka* and *Doha*, where English is commonly used as a second language, as well as from *New York, USA*. The QA pairs in this dataset cover 18 diverse topics, including: *Animals, Business, Clothing, Education, Events, Food & Drinks, General, Geography, Immigration, Language, Literature, Names & Persons, Plants, Religion, Sports & Games, Tradition, Travel*, and *Weather*. **MultiNativQA** is designed to evaluate and fine-tune large language models (LLMs) for long-form question answering while assessing their cultural adaptability and understanding. ### Directory Structure (JSON files only) The dataset is organized into directories based on language and region. Each directory contains JSON files for the train, development, and test sets, with the exception of Nepali, which consists of only a test set. - `arabic_qa/` - `NativQA_ar_msa_qa_dev.json` - `NativQA_ar_msa_qa_test.json` - `NativQA_ar_msa_qa_train.json` - `assamese_in/` - `NativQA_asm_NA_in_dev.json` - `NativQA_asm_NA_in_test.json` - `NativQA_asm_NA_in_train.json` - `bangla_bd/` - `NativQA_bn_scb_bd_dev.json` - `NativQA_bn_scb_bd_test.json` - `NativQA_bn_scb_bd_train.json` - `bangla_in/` - `NativQA_bn_scb_in_dev.json` - `NativQA_bn_scb_in_test.json` - `NativQA_bn_scb_in_train.json` - `english_bd/` - `NativQA_en_NA_bd_dev.json` - `NativQA_en_NA_bd_test.json` - `NativQA_en_NA_bd_train.json` - `english_qa/` - `NativQA_en_NA_qa_dev.json` - `NativQA_en_NA_qa_test.json` - `NativQA_en_NA_qa_train.json` - `hindi_in/` - `NativQA_hi_NA_in_dev.json` - `NativQA_hi_NA_in_test.json` - `NativQA_hi_NA_in_train.json` - `nepali_np/` - `NativQA_ne_NA_np_test.json` - `turkish_tr/` - `NativQA_tr_NA_tr_dev.json` - `NativQA_tr_NA_tr_test.json` - `NativQA_tr_NA_tr_train.json` #### Example of a data ``` { "data_id": "cf92ec1e52b4b3071d263a1063b43928", "category": "immigration", "input_query": "How long can you stay in Qatar on a visitors visa?", "question": "Can I extend my tourist visa in Qatar?", "is_reliable": "very_reliable", "answer": "If you would like to extend your visa, you will need to proceed to immigration headquarters in Doha prior to the expiry of your visa and apply there for an extension.", "source_answer_url": "https://hayya.qa/en/web/hayya/faq" } ``` ##### Field Descriptions: - **`data_id`**: Unique identifier for each data entry. - **`category`**: General topic or category of the query (e.g., "health", "religion"). - **`input_query`**: The original user-submitted query. - **`question`**: The formalized question derived from the input query. - **`is_reliable`**: Indicates the reliability of the provided answer (`"very_reliable"`, `"somewhat_reliable"`, `"unreliable"`). - **`answer`**: The system-provided answer to the query. - **`source_answer_url`**: URL of the source from which the answer was derived. ### Statistics Distribution of the **MultiNativQA** dataset across different languages. <p align="left"> <img src="./language_donut_chart.png" style="width: 60%;" id="title-icon"> </p> This dataset consists of two types of data: annotated and un-annotated. We considered the un-annotated data as additional data. Please find the data statistics below: Statistics of our **MultiNativQA** dataset including languages with the final annotated QA pairs from different location. | Language | City | Train | Dev | Test | Total | |-------------|------------|---------|-------|--------|--------| | Arabic | Doha | 3,649 | 492 | 988 | 5,129 | | Assamese | Assam | 1,131 | 157 | 545 | 1,833 | | Bangla | Dhaka | 7,018 | 953 | 1,521 | 9,492 | | Bangla | Kolkata | 6,891 | 930 | 2,146 | 9,967 | | English | Dhaka | 4,761 | 656 | 1,113 | 6,530 | | English | Doha | 8,212 | 1,164 | 2,322 | 11,698 | | Hindi | Delhi | 9,288 | 1,286 | 2,745 | 13,319 | | Nepali | Kathmandu | -- | -- | 561 | 561 | | Turkish | Istanbul | 3,527 | 483 | 1,218 | 5,228 | | **Total** | | **44,477** | **6,121** | **13,159** | **63,757** | We provide the un-annotated additional data stats below: | Language-Location | # of QA | |-------------------------|---------------| | Arabic-Egypt | 7,956 | | Arabic-Palestine | 5,679 | | Arabic-Sudan | 4,718 | | Arabic-Syria | 11,288 | | Arabic-Tunisia | 14,789 | | Arabic-Yemen | 4,818 | | English-New York | 6,454 | | **Total** | **55,702** | ### How to download data ``` import os import json from datasets import load_dataset dataset_names = ['arabic_qa', 'assamese_in', 'bangla_bd', 'bangla_in', 'english_bd', 'english_qa', 'hindi_in', 'nepali_np', 'turkish_tr'] base_dir="./MNQA/" for dname in dataset_names: output_dir = os.path.join(base_dir, dname) # load each language dataset = load_dataset("QCRI/MultiNativQA", name=dname) # Save the dataset to the specified directory. This will save all splits to the output directory. dataset.save_to_disk(output_dir) # iterate over splits to save the data into json format for split in ['train','dev','test']: data = [] if split not in dataset: continue for idx, item in enumerate(dataset[split]): data.append(item) output_file = os.path.join(output_dir, f"{split}.json") with open(output_file, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=4) ``` ### License The dataset is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). The full license text can be found in the accompanying licenses_by-nc-sa_4.0_legalcode.txt file. ### Contact & Additional Information For more details, please visit our [official website](http://nativqa.gitlab.io/). ### Citation You can access the full paper [here](https://arxiv.org/pdf/2407.09823). ``` @article{hasan2024nativqa, title={NativQA: Multilingual Culturally-Aligned Natural Query for LLMs}, author={Hasan, Md Arid and Hasanain, Maram and Ahmad, Fatema and Laskar, Sahinur Rahman and Upadhyay, Sunaya and Sukhadia, Vrunda N and Kutlu, Mucahid and Chowdhury, Shammur Absar and Alam, Firoj}, journal={arXiv preprint arXiv:2407.09823}, year={2024} publisher={arXiv:2407.09823}, url={https://arxiv.org/abs/2407.09823}, } ```

# MultiNativQA：面向大语言模型（Large Language Models，LLMs）的多语言文化适配自然问答数据集 ## 概述 **MultiNativQA** 数据集是一款多语言、原生且适配文化的问答资源。它涵盖7种语言，覆盖范围从高资源语言延伸至极低资源语言，涉及9个不同的地区与城市。为充分体现语言多样性，该数据集为阿拉伯语这类方言丰富的语言提供了多种方言版本。除现代标准阿拉伯语（Modern Standard Arabic, MSA）外，**MultiNativQA** 还包含6种阿拉伯语方言——埃及方言、约旦方言、海湾方言、苏丹方言、突尼斯方言以及也门方言。该数据集还提供了两种孟加拉语变体，分别对应孟加拉国与印度西孟加拉邦的使用者语言差异。此外，**MultiNativQA** 收录了来自达卡、多哈（英语在此两地均作为通用第二语言）以及美国纽约的英语查询样本。本数据集的问答对覆盖18个多样化主题，包括：动物、商业、服饰、教育、事件、食品与饮料、综合、地理、移民、语言、文学、姓名与人物、植物、宗教、体育与游戏、传统、旅行以及天气。 **MultiNativQA** 旨在评估并微调用于长文本问答的大语言模型（Large Language Model，LLM），同时检验模型的文化适配能力与理解水平。 ## 仅JSON文件的目录结构本数据集按照语言与地区划分目录。每个目录包含训练集、开发集与测试集对应的JSON文件，但尼泊尔语数据集仅包含测试集。 - `arabic_qa/` - `NativQA_ar_msa_qa_dev.json` - `NativQA_ar_msa_qa_test.json` - `NativQA_ar_msa_qa_train.json` - `assamese_in/` - `NativQA_asm_NA_in_dev.json` - `NativQA_asm_NA_in_test.json` - `NativQA_asm_NA_in_train.json` - `bangla_bd/` - `NativQA_bn_scb_bd_dev.json` - `NativQA_bn_scb_bd_test.json` - `NativQA_bn_scb_bd_train.json` - `bangla_in/` - `NativQA_bn_scb_in_dev.json` - `NativQA_bn_scb_in_test.json` - `NativQA_bn_scb_in_train.json` - `english_bd/` - `NativQA_en_NA_bd_dev.json` - `NativQA_en_NA_bd_test.json` - `NativQA_en_NA_bd_train.json` - `english_qa/` - `NativQA_en_NA_qa_dev.json` - `NativQA_en_NA_qa_test.json` - `NativQA_en_NA_qa_train.json` - `hindi_in/` - `NativQA_hi_NA_in_dev.json` - `NativQA_hi_NA_in_test.json` - `NativQA_hi_NA_in_train.json` - `nepali_np/` - `NativQA_ne_NA_np_test.json` - `turkish_tr/` - `NativQA_tr_NA_tr_dev.json` - `NativQA_tr_NA_tr_test.json` - `NativQA_tr_NA_tr_train.json` ### 数据示例 { "data_id": "cf92ec1e52b4b3071d263a1063b43928", "category": "immigration", "input_query": "How long can you stay in Qatar on a visitors visa?", "question": "Can I extend my tourist visa in Qatar?", "is_reliable": "very_reliable", "answer": "If you would like to extend your visa, you will need to proceed to immigration headquarters in Doha prior to the expiry of your visa and apply there for an extension.", "source_answer_url": "https://hayya.qa/en/web/hayya/faq" } #### 字段说明 - **`data_id`**：每条数据条目的唯一标识符。 - **`category`**：查询所属的通用主题或类别（例如“医疗”“宗教”）。 - **`input_query`**：用户原始提交的查询内容。 - **`question`**：从原始查询中提炼得到的规范化问题。 - **`is_reliable`**：表示所提供答案的可靠程度，可选值为`"very_reliable"`（非常可靠）、`"somewhat_reliable"`（较为可靠）、`"unreliable"`（不可靠）。 - **`answer`**：针对该查询的系统生成答案。 - **`source_answer_url`**：答案来源的URL地址。 ## 统计信息 **MultiNativQA** 数据集在不同语言间的分布情况。 <p align="left"> <img src="./language_donut_chart.png" style="width: 60%;" id="title-icon"> </p> 本数据集包含两类数据：带标注数据与无标注数据，其中无标注数据被视为额外补充数据。以下为数据集统计详情： ### 带标注问答对统计（按语言与地区划分） | 语言 | 城市 | 训练集 | 开发集 | 测试集 | 总计 | |-------------|------------|--------|--------|--------|--------| | 阿拉伯语 | 多哈 | 3,649 | 492 | 988 | 5,129 | | 阿萨姆语 | 阿萨姆邦 | 1,131 | 157 | 545 | 1,833 | | 孟加拉语 | 达卡 | 7,018 | 953 | 1,521 | 9,492 | | 孟加拉语 | 加尔各答 | 6,891 | 930 | 2,146 | 9,967 | | 英语 | 达卡 | 4,761 | 656 | 1,113 | 6,530 | | 英语 | 多哈 | 8,212 | 1,164 | 2,322 | 11,698 | | 印地语 | 新德里 | 9,288 | 1,286 | 2,745 | 13,319 | | 尼泊尔语 | 加德满都 | -- | -- | 561 | 561 | | 土耳其语 | 伊斯坦布尔 | 3,527 | 483 | 1,218 | 5,228 | | **总计** | | **44,477** | **6,121** | **13,159** | **63,757** | ### 无标注补充数据统计 | 语言-地区 | 问答对数量 | |---------------------|------------| | 阿拉伯语-埃及 | 7,956 | | 阿拉伯语-巴勒斯坦 | 5,679 | | 阿拉伯语-苏丹 | 4,718 | | 阿拉伯语-叙利亚 | 11,288 | | 阿拉伯语-突尼斯 | 14,789 | | 阿拉伯语-也门 | 4,818 | | 英语-纽约 | 6,454 | | **总计** | **55,702** | ## 数据下载方法 import os import json from datasets import load_dataset dataset_names = ['arabic_qa', 'assamese_in', 'bangla_bd', 'bangla_in', 'english_bd', 'english_qa', 'hindi_in', 'nepali_np', 'turkish_tr'] base_dir="./MNQA/" for dname in dataset_names: output_dir = os.path.join(base_dir, dname) # 加载对应语言的数据集 dataset = load_dataset("QCRI/MultiNativQA", name=dname) # 将数据集保存至指定目录，该操作会将所有划分的数据保存至output_dir dataset.save_to_disk(output_dir) # 遍历数据集划分，将数据保存为JSON格式 for split in ['train','dev','test']: data = [] if split not in dataset: continue for idx, item in enumerate(dataset[split]): data.append(item) output_file = os.path.join(output_dir, f"{split}.json") with open(output_file, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=4) ## 授权协议本数据集采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议（Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0）进行分发。完整的许可协议文本可参阅附带的`licenses_by-nc-sa_4.0_legalcode.txt`文件。 ## 联系方式与补充信息如需了解更多详情，请访问我们的[官方网站](http://nativqa.gitlab.io/)。 ## 引用方式您可通过[此链接](https://arxiv.org/pdf/2407.09823)获取完整论文。 @article{hasan2024nativqa, title={NativQA: Multilingual Culturally-Aligned Natural Query for LLMs}, author={Hasan, Md Arid and Hasanain, Maram and Ahmad, Fatema and Laskar, Sahinur Rahman and Upadhyay, Sunaya and Sukhadia, Vrunda N and Kutlu, Mucahid and Chowdhury, Shammur Absar and Alam, Firoj}, journal={arXiv preprint arXiv:2407.09823}, year={2024} publisher={arXiv:2407.09823}, url={https://arxiv.org/abs/2407.09823}, }

提供机构：

maas

创建时间：

2025-06-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集