google-research-datasets/tydiqa|多语言问答数据集|自然语言处理数据集

hugging_face2024-08-08 更新2024-06-15 收录

多语言问答

自然语言处理

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/tydiqa

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - ar - bn - en - fi - id - ja - ko - ru - sw - te - th license: - apache-2.0 multilinguality: - multilingual size_categories: - unknown source_datasets: - extended|wikipedia task_categories: - question-answering task_ids: - extractive-qa paperswithcode_id: tydi-qa pretty_name: TyDi QA dataset_info: - config_name: primary_task features: - name: passage_answer_candidates sequence: - name: plaintext_start_byte dtype: int32 - name: plaintext_end_byte dtype: int32 - name: question_text dtype: string - name: document_title dtype: string - name: language dtype: string - name: annotations sequence: - name: passage_answer_candidate_index dtype: int32 - name: minimal_answers_start_byte dtype: int32 - name: minimal_answers_end_byte dtype: int32 - name: yes_no_answer dtype: string - name: document_plaintext dtype: string - name: document_url dtype: string splits: - name: train num_bytes: 5550573801 num_examples: 166916 - name: validation num_bytes: 484380347 num_examples: 18670 download_size: 2912112378 dataset_size: 6034954148 - config_name: secondary_task features: - name: id dtype: string - name: title dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: train num_bytes: 52948467 num_examples: 49881 - name: validation num_bytes: 5006433 num_examples: 5077 download_size: 29402238 dataset_size: 57954900 configs: - config_name: primary_task data_files: - split: train path: primary_task/train-* - split: validation path: primary_task/validation-* - config_name: secondary_task data_files: - split: train path: secondary_task/train-* - split: validation path: secondary_task/validation-* --- # Dataset Card for "tydiqa" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/google-research-datasets/tydiqa](https://github.com/google-research-datasets/tydiqa) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 3.91 GB - **Size of the generated dataset:** 6.10 GB - **Total amount of disk used:** 10.00 GB ### Dataset Summary TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, (unlike SQuAD and its descendents) and the data is collected directly in each language without the use of translation (unlike MLQA and XQuAD). ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### primary_task - **Size of downloaded dataset files:** 1.95 GB - **Size of the generated dataset:** 6.04 GB - **Total amount of disk used:** 7.99 GB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "annotations": { "minimal_answers_end_byte": [-1, -1, -1], "minimal_answers_start_byte": [-1, -1, -1], "passage_answer_candidate_index": [-1, -1, -1], "yes_no_answer": ["NONE", "NONE", "NONE"] }, "document_plaintext": "\"\\nรองศาสตราจารย์[1] หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร (22 กันยายน 2495 -) ผู้ว่าราชการกรุงเทพมหานครคนที่ 15 อดีตรองหัวหน้าพรรคปร...", "document_title": "หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร", "document_url": "\"https://th.wikipedia.org/wiki/%E0%B8%AB%E0%B8%A1%E0%B9%88%E0%B8%AD%E0%B8%A1%E0%B8%A3%E0%B8%B2%E0%B8%8A%E0%B8%A7%E0%B8%87%E0%B8%...", "language": "thai", "passage_answer_candidates": "{\"plaintext_end_byte\": [494, 1779, 2931, 3904, 4506, 5588, 6383, 7122, 8224, 9375, 10473, 12563, 15134, 17765, 19863, 21902, 229...", "question_text": "\"หม่อมราชวงศ์สุขุมพันธุ์ บริพัตร เรียนจบจากที่ไหน ?\"..." } ``` #### secondary_task - **Size of downloaded dataset files:** 1.95 GB - **Size of the generated dataset:** 58.03 MB - **Total amount of disk used:** 2.01 GB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "answers": { "answer_start": [394], "text": ["بطولتين"] }, "context": "\"أقيمت البطولة 21 مرة، شارك في النهائيات 78 دولة، وعدد الفرق التي فازت بالبطولة حتى الآن 8 فرق، ويعد المنتخب البرازيلي الأكثر تت...", "id": "arabic-2387335860751143628-1", "question": "\"كم عدد مرات فوز الأوروغواي ببطولة كاس العالم لكرو القدم؟\"...", "title": "قائمة نهائيات كأس العالم" } ``` ### Data Fields The data fields are the same among all splits. #### primary_task - `passage_answer_candidates`: a dictionary feature containing: - `plaintext_start_byte`: a `int32` feature. - `plaintext_end_byte`: a `int32` feature. - `question_text`: a `string` feature. - `document_title`: a `string` feature. - `language`: a `string` feature. - `annotations`: a dictionary feature containing: - `passage_answer_candidate_index`: a `int32` feature. - `minimal_answers_start_byte`: a `int32` feature. - `minimal_answers_end_byte`: a `int32` feature. - `yes_no_answer`: a `string` feature. - `document_plaintext`: a `string` feature. - `document_url`: a `string` feature. #### secondary_task - `id`: a `string` feature. - `title`: a `string` feature. - `context`: a `string` feature. - `question`: a `string` feature. - `answers`: a dictionary feature containing: - `text`: a `string` feature. - `answer_start`: a `int32` feature. ### Data Splits | name | train | validation | | -------------- | -----: | ---------: | | primary_task | 166916 | 18670 | | secondary_task | 49881 | 5077 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{tydiqa, title = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki} year = {2020}, journal = {Transactions of the Association for Computational Linguistics} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@albertvillanova](https://github.com/albertvillanova), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.

提供机构：

google-research-datasets

原始信息汇总

数据集概述

基本信息

数据集名称: TyDi QA
标注创建者: 众包
语言创建者: 众包
语言: 阿拉伯语, 孟加拉语, 英语, 芬兰语, 印度尼西亚语, 日语, 韩语, 俄语, 斯瓦希里语, 泰卢固语, 泰语
许可证: Apache 2.0
多语言性: 多语言
源数据集: 扩展自 Wikipedia
任务类别: 问答
任务ID: 抽取式问答
PapersWithCode ID: tydi-qa

数据集结构

配置信息

primary_task
- 特征:
  - passage_answer_candidates: 包含 plaintext_start_byte 和 plaintext_end_byte
  - question_text: 字符串
  - document_title: 字符串
  - language: 字符串
  - annotations: 包含 passage_answer_candidate_index, minimal_answers_start_byte, minimal_answers_end_byte, yes_no_answer
  - document_plaintext: 字符串
  - document_url: 字符串
- 分割:
  - train: 166916 个样本, 5550574617 字节
  - validation: 18670 个样本, 484380443 字节
- 下载大小: 1953887429 字节
- 数据集大小: 6034955060 字节
secondary_task
- 特征:
  - id: 字符串
  - title: 字符串
  - context: 字符串
  - question: 字符串
  - answers: 包含 text 和 answer_start
- 分割:
  - train: 49881 个样本, 52948607 字节
  - validation: 5077 个样本, 5006461 字节
- 下载大小: 1953887429 字节
- 数据集大小: 57955068 字节

引用信息

@article{tydiqa, title = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki} year = {2020}, journal = {Transactions of the Association for Computational Linguistics} }

AI搜集汇总

数据集介绍

构建方式

TyDi QA数据集通过众包方式构建，涵盖了11种语言，包括阿拉伯语、孟加拉语、英语、芬兰语、印度尼西亚语、日语、韩语、俄语、斯瓦希里语、泰卢固语和泰语。数据集的构建旨在提供一个多语言的问答基准，避免翻译带来的偏差，确保问题和答案的真实性。数据来源于扩展的维基百科，通过直接在每种语言中收集数据，而非翻译，以捕捉不同语言的独特语言现象。

特点

TyDi QA数据集的主要特点在于其多语言性和语言多样性。它包含了204,000个问题-答案对，覆盖了11种语言，这些语言在语言学特征上具有显著的多样性。数据集的设计旨在测试模型在不同语言间的泛化能力，特别是那些在英语语料库中不常见的语言现象。此外，数据集避免了翻译带来的偏差，确保了问题的真实性和答案的准确性。

使用方法

TyDi QA数据集适用于多种自然语言处理任务，特别是问答系统。用户可以通过HuggingFace的datasets库加载该数据集，并根据需要选择不同的配置（如primary_task和secondary_task）。数据集提供了丰富的字段，包括问题文本、文档标题、语言标识、答案候选者及其位置信息等。用户可以根据这些字段进行模型训练和验证，以提升多语言问答系统的性能。

背景与挑战

背景概述

TyDi QA数据集由Google Research团队创建，旨在推动多语言问答系统的研究。该数据集涵盖了11种类型学上多样化的语言，包括阿拉伯语、孟加拉语、英语、芬兰语、印度尼西亚语、日语、韩语、俄语、斯瓦希里语、泰卢固语和泰语，共包含204,000个问答对。其核心研究问题是如何在多语言环境下实现高效的信息检索和问答系统，特别是如何处理非英语语言中的复杂语言现象。TyDi QA的创建旨在提供一个真实的信息检索任务，避免翻译带来的偏差，并促进模型在多语言环境下的泛化能力。该数据集的发布对多语言自然语言处理领域产生了深远影响，为研究人员提供了一个评估和改进多语言问答系统的重要基准。

当前挑战

TyDi QA数据集面临的主要挑战包括：1) 多语言环境的复杂性，不同语言的语法、词汇和表达方式差异巨大，导致模型在跨语言迁移时面临困难；2) 数据收集和标注的难度，由于涉及多种语言，数据的质量和一致性难以保证；3) 语言现象的多样性，某些语言特有的现象在其他语言中不存在，增加了模型理解和处理的复杂性。此外，构建过程中还需克服语言资源不均衡的问题，确保每种语言的数据量和质量都能满足研究需求。这些挑战使得TyDi QA成为多语言问答系统研究中的重要里程碑，同时也为未来的研究提供了丰富的探索空间。

常用场景

经典使用场景

TyDi QA数据集在多语言问答任务中展现了其经典应用场景。该数据集涵盖了11种语言，旨在评估模型在不同语言环境下的问答能力。通过提供多语言的问答对，TyDi QA允许研究者开发和测试能够跨语言泛化的问答系统，尤其是在非英语语言中的表现。

解决学术问题

TyDi QA数据集解决了多语言问答系统中的关键学术问题，特别是在非英语语言中的问答能力。通过提供多语言的问答对，该数据集帮助研究者评估和改进模型在不同语言中的泛化能力，从而推动了多语言自然语言处理领域的发展。

衍生相关工作

TyDi QA数据集的发布激发了大量相关研究工作，特别是在多语言问答和跨语言模型泛化方面。许多研究者基于该数据集开发了新的模型和方法，以提升多语言问答的准确性和效率，推动了多语言自然语言处理技术的进步。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

OMIM (Online Mendelian Inheritance in Man)

OMIM是一个包含人类基因和遗传疾病信息的在线数据库。它提供了详细的遗传疾病描述、基因定位、相关文献和临床信息。数据集内容包括疾病名称、基因名称、基因定位、遗传模式、临床特征、相关文献引用等。

www.omim.org 收录

jpft/danbooru2023

Danbooru2023是一个大规模的动漫图像数据集，包含超过500万张由爱好者社区贡献并详细标注的图像。图像标签涵盖角色、场景、版权、艺术家等方面，平均每张图像有30个标签。该数据集可用于训练图像分类、多标签标注、角色检测、生成模型等多种计算机视觉任务。数据集基于danbooru2021构建，扩展至包含ID #6,857,737的图像，增加了超过180万张新图像，总大小约为8TB。图像以原始格式提供，分为1000个子目录，使用图像ID的模1000进行分桶，以避免文件系统性能问题。

hugging_face 收录

UCM-Captions, Sydney-Captions, RSICD, RSITMD, NWPU-Captions, RS5M, SkyScript

UCM-Captions: 包含613张图像，分辨率为256×256。Sydney-Captions: 包含2,100张图像，分辨率为500×500。RSICD: 包含10,921张图像，分辨率为224×224。RSITMD: 包含4,743张图像，分辨率为256×256。NWPU-Captions: 包含31,500张图像，分辨率为256×256。RS5M: 包含超过500万张图像，分辨率为所有可能的分辨率。SkyScript: 包含520万张图像，分辨率为所有可能的分辨率。

github 收录

UniProt

UniProt（Universal Protein Resource）是全球公认的蛋白质序列与功能信息权威数据库，由欧洲生物信息学研究所（EBI）、瑞士生物信息学研究所（SIB）和美国蛋白质信息资源中心（PIR）联合运营。该数据库以其广度和深度兼备的蛋白质信息资源闻名，整合了实验验证的高质量数据与大规模预测的自动注释内容，涵盖从分子序列、结构到功能的全面信息。UniProt核心包括注释详尽的UniProtKB知识库（分为人工校验的Swiss-Prot和自动生成的TrEMBL），以及支持高效序列聚类分析的UniRef和全局蛋白质序列归档的UniParc。其卓越的数据质量和多样化的检索工具，为基础研究和药物研发提供了无可替代的支持，成为生物学研究中不可或缺的资源。

www.uniprot.org 收录

DAT

DAT是一个统一的跨场景跨领域基准，用于开放世界无人机主动跟踪。它提供了24个视觉复杂的场景，以评估算法的跨场景和跨领域泛化能力，并具有高保真度的现实机器人动力学建模。

github 收录