PangeaBench-tydiqa

Name: PangeaBench-tydiqa
Creator: maas
Published: 2025-11-27 16:52:22
License: 暂无描述

魔搭社区2025-11-27 更新2025-11-15 收录

下载链接：

https://modelscope.cn/datasets/neulab/PangeaBench-tydiqa

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "tydiqa" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/google-research-datasets/tydiqa](https://github.com/google-research-datasets/tydiqa) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 3.91 GB - **Size of the generated dataset:** 6.10 GB - **Total amount of disk used:** 10.00 GB ### Dataset Summary TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, (unlike SQuAD and its descendents) and the data is collected directly in each language without the use of translation (unlike MLQA and XQuAD). ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### primary_task - **Size of downloaded dataset files:** 1.95 GB - **Size of the generated dataset:** 6.04 GB - **Total amount of disk used:** 7.99 GB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "annotations": { "minimal_answers_end_byte": [-1, -1, -1], "minimal_answers_start_byte": [-1, -1, -1], "passage_answer_candidate_index": [-1, -1, -1], "yes_no_answer": ["NONE", "NONE", "NONE"] }, "document_plaintext": "\"\\nรองศาสตราจารย์[1] หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร (22 กันยายน 2495 -) ผู้ว่าราชการกรุงเทพมหานครคนที่ 15 อดีตรองหัวหน้าพรรคปร...", "document_title": "หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร", "document_url": "\"https://th.wikipedia.org/wiki/%E0%B8%AB%E0%B8%A1%E0%B9%88%E0%B8%AD%E0%B8%A1%E0%B8%A3%E0%B8%B2%E0%B8%8A%E0%B8%A7%E0%B8%87%E0%B8%...", "language": "thai", "passage_answer_candidates": "{\"plaintext_end_byte\": [494, 1779, 2931, 3904, 4506, 5588, 6383, 7122, 8224, 9375, 10473, 12563, 15134, 17765, 19863, 21902, 229...", "question_text": "\"หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร เรียนจบจากที่ไหน ?\"..." } ``` #### secondary_task - **Size of downloaded dataset files:** 1.95 GB - **Size of the generated dataset:** 58.03 MB - **Total amount of disk used:** 2.01 GB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "answers": { "answer_start": [394], "text": ["بطولتين"] }, "context": "\"أقيمت البطولة 21 مرة، شارك في النهائيات 78 دولة، وعدد الفرق التي فازت بالبطولة حتى الآن 8 فرق، ويعد المنتخب البرازيلي الأكثر تت...", "id": "arabic-2387335860751143628-1", "question": "\"كم عدد مرات فوز الأوروغواي ببطولة كاس العالم لكرو القدم؟\"...", "title": "قائمة نهائيات كأس العالم" } ``` ### Data Fields The data fields are the same among all splits. #### primary_task - `passage_answer_candidates`: a dictionary feature containing: - `plaintext_start_byte`: a `int32` feature. - `plaintext_end_byte`: a `int32` feature. - `question_text`: a `string` feature. - `document_title`: a `string` feature. - `language`: a `string` feature. - `annotations`: a dictionary feature containing: - `passage_answer_candidate_index`: a `int32` feature. - `minimal_answers_start_byte`: a `int32` feature. - `minimal_answers_end_byte`: a `int32` feature. - `yes_no_answer`: a `string` feature. - `document_plaintext`: a `string` feature. - `document_url`: a `string` feature. #### secondary_task - `id`: a `string` feature. - `title`: a `string` feature. - `context`: a `string` feature. - `question`: a `string` feature. - `answers`: a dictionary feature containing: - `text`: a `string` feature. - `answer_start`: a `int32` feature. ### Data Splits | name | train | validation | | -------------- | -----: | ---------: | | primary_task | 166916 | 18670 | | secondary_task | 49881 | 5077 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{tydiqa, title = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki} year = {2020}, journal = {Transactions of the Association for Computational Linguistics} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@albertvillanova](https://github.com/albertvillanova), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.

# 数据集卡片：“tydiqa” ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [涉及语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **"主页"**：[https://github.com/google-research-datasets/tydiqa](https://github.com/google-research-datasets/tydiqa) - **"代码仓库"**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **"相关论文"**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **"联络人"**：[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **"下载数据集文件大小"**：3.91 GB - **"生成数据集大小"**：6.10 GB - **"总磁盘占用"**：10.00 GB ### 数据集概述 TyDi QA是一个问答数据集，涵盖11种类型学特征多样的语言，包含204,000个问答对。TyDi QA所覆盖的语言在类型学特征——即每种语言所具备的语言学特征集合——上差异显著，因此我们期望在该数据集上表现优异的模型，能够泛化至全球绝大多数语言。该数据集包含仅英语语料库中无法见到的语言现象。为还原真实的信息检索问答任务并避免启动效应（priming effects），问题由渴望获取答案但尚未知晓答案的人群撰写（与SQuAD及其衍生数据集不同），且数据直接以各语言原生方式收集，未经过翻译环节（与MLQA和XQuAD不同）。 ### 支持任务与基准榜单 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 涉及语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 主任务 - **"下载数据集文件大小"**：1.95 GB - **"生成数据集大小"**：6.04 GB - **"总磁盘占用"**：7.99 GB 验证集的一个示例如下：该示例过长已被截断： { "annotations": { "minimal_answers_end_byte": [-1, -1, -1], "minimal_answers_start_byte": [-1, -1, -1], "passage_answer_candidate_index": [-1, -1, -1], "yes_no_answer": ["NONE", "NONE", "NONE"] }, "document_plaintext": ""\nรองศาสตราจารย์[1] หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร (22 กันยายน 2495 -) ผู้ว่าราชการกรุงเทพมหานครคนที่ 15 อดีตรองหัวหน้าพรรคปร...", "document_title": "หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร", "document_url": ""https://th.wikipedia.org/wiki/%E0%B8%AB%E0%B8%A1%E0%B9%88%E0%B8%AD%E0%B8%A1%E0%B8%A3%E0%B8%B2%E0%B8%8A%E0%B8%A7%E0%B8%87%E0%B8%...", "language": "thai", "passage_answer_candidates": "{"plaintext_end_byte": [494, 1779, 2931, 3904, 4506, 5588, 6383, 7122, 8224, 9375, 10473, 12563, 15134, 17765, 19863, 21902, 229...", "question_text": ""หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร เรียนจบจากที่ไหน ?"..." } #### 辅助任务 - **"下载数据集文件大小"**：1.95 GB - **"生成数据集大小"**：58.03 MB - **"总磁盘占用"**：2.01 GB 验证集的一个示例如下：该示例过长已被截断： { "answers": { "answer_start": [394], "text": ["بطولتين"] }, "context": ""أقيمت البطولة 21 مرة، شارك في النهائيات 78 دولة، وعدد الفرق التي فازت بالبطولة حتى الآن 8 فرق، ويعد المنتخب البرازيلي الأكثر تت...", "id": "arabic-2387335860751143628-1", "question": ""كم عدد مرات فوز الأوروغواي ببطولة كاس العالم لكرو القدم؟"...", "title": "قائمة نهائيات كأس العالم" } ### 数据字段所有划分下的数据字段均保持一致。 #### 主任务 - `passage_answer_candidates`：字典类型特征，包含以下字段： - `plaintext_start_byte`：`int32`类型特征。 - `plaintext_end_byte`：`int32`类型特征。 - `question_text`：字符串类型特征。 - `document_title`：字符串类型特征。 - `language`：字符串类型特征。 - `annotations`：字典类型特征，包含以下字段： - `passage_answer_candidate_index`：`int32`类型特征。 - `minimal_answers_start_byte`：`int32`类型特征。 - `minimal_answers_end_byte`：`int32`类型特征。 - `yes_no_answer`：字符串类型特征。 - `document_plaintext`：字符串类型特征。 - `document_url`：字符串类型特征。 #### 辅助任务 - `id`：字符串类型特征。 - `title`：字符串类型特征。 - `context`：字符串类型特征。 - `question`：字符串类型特征。 - `answers`：字典类型特征，包含以下字段： - `text`：字符串类型特征。 - `answer_start`：`int32`类型特征。 ### 数据划分 | 任务名称 | 训练集样本数 | 验证集样本数 | | -------------- | -----------: | -----------: | | 主任务 | 166916 | 18670 | | 辅助任务 | 49881 | 5077 | ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @article{tydiqa, title = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki}, year = {2020}, journal = {Transactions of the Association for Computational Linguistics} } ### 贡献者感谢[@thomwolf](https://github.com/thomwolf)、[@albertvillanova](https://github.com/albertvillanova)、[@lewtun](https://github.com/lewtun)、[@patrickvonplaten](https://github.com/patrickvonplaten)为本数据集的添加工作。

提供机构：

maas

创建时间：

2025-10-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集