five

google-research-datasets/xquad_r

收藏
Hugging Face2024-01-04 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/google-research-datasets/xquad_r
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - ar - de - el - en - es - hi - ru - th - tr - vi - zh license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - 1K<n<10K source_datasets: - extended|squad - extended|xquad task_categories: - question-answering task_ids: - extractive-qa paperswithcode_id: xquad-r pretty_name: LAReQA config_names: - ar - de - el - en - es - hi - ru - th - tr - vi - zh dataset_info: - config_name: ar features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1722775 num_examples: 1190 download_size: 263002 dataset_size: 1722775 - config_name: de features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1283277 num_examples: 1190 download_size: 241957 dataset_size: 1283277 - config_name: el features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2206666 num_examples: 1190 download_size: 324379 dataset_size: 2206666 - config_name: en features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1116099 num_examples: 1190 download_size: 212372 dataset_size: 1116099 - config_name: es features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1273475 num_examples: 1190 download_size: 236874 dataset_size: 1273475 - config_name: hi features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2682951 num_examples: 1190 download_size: 322083 dataset_size: 2682951 - config_name: ru features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2136966 num_examples: 1190 download_size: 321728 dataset_size: 2136966 - config_name: th features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 2854935 num_examples: 1190 download_size: 337307 dataset_size: 2854935 - config_name: tr features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1210739 num_examples: 1190 download_size: 228364 dataset_size: 1210739 - config_name: vi features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 1477215 num_examples: 1190 download_size: 237644 dataset_size: 1477215 - config_name: zh features: - name: id dtype: string - name: context dtype: string - name: question dtype: string - name: answers sequence: - name: text dtype: string - name: answer_start dtype: int32 splits: - name: validation num_bytes: 984217 num_examples: 1190 download_size: 205768 dataset_size: 984217 configs: - config_name: ar data_files: - split: validation path: ar/validation-* - config_name: de data_files: - split: validation path: de/validation-* - config_name: el data_files: - split: validation path: el/validation-* - config_name: en data_files: - split: validation path: en/validation-* - config_name: es data_files: - split: validation path: es/validation-* - config_name: hi data_files: - split: validation path: hi/validation-* - config_name: ru data_files: - split: validation path: ru/validation-* - config_name: th data_files: - split: validation path: th/validation-* - config_name: tr data_files: - split: validation path: tr/validation-* - config_name: vi data_files: - split: validation path: vi/validation-* - config_name: zh data_files: - split: validation path: zh/validation-* --- # Dataset Card for [Dataset Name] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [LAReQA](https://github.com/google-research-datasets/lareqa) - **Repository:** [XQuAD-R](https://github.com/google-research-datasets/lareqa) - **Paper:** [LAReQA: Language-agnostic answer retrieval from a multilingual pool](https://arxiv.org/pdf/2004.05484.pdf) - **Point of Contact:** [Noah Constant](mailto:nconstant@google.com) ### Dataset Summary XQuAD-R is a retrieval version of the XQuAD dataset (a cross-lingual extractive QA dataset). Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question appears in 11 different languages and has 11 parallel correct answers across the languages. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset can be found with the following languages: * Arabic: `xquad-r/ar.json` * German: `xquad-r/de.json` * Greek: `xquad-r/el.json` * English: `xquad-r/en.json` * Spanish: `xquad-r/es.json` * Hindi: `xquad-r/hi.json` * Russian: `xquad-r/ru.json` * Thai: `xquad-r/th.json` * Turkish: `xquad-r/tr.json` * Vietnamese: `xquad-r/vi.json` * Chinese: `xquad-r/zh.json` ## Dataset Structure [More Information Needed] ### Data Instances An example from `en` config: ``` {'id': '56beb4343aeaaa14008c925b', 'context': "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections. Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two. Fellow lineman Mario Addison added 6½ sacks. The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL's active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts. Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly. Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own. Carolina's secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high seven interceptions, while also racking up 88 tackles and Pro Bowl cornerback Josh Norman, who developed into a shutdown corner during the season and had four interceptions, two of which were returned for touchdowns.", 'question': 'How many points did the Panthers defense surrender?', 'answers': {'text': ['308'], 'answer_start': [34]}} ``` ### Data Fields - `id` (`str`): Unique ID for the context-question pair. - `context` (`str`): Context for the question. - `question` (`str`): Question. - `answers` (`dict`): Answers with the following keys: - `text` (`list` of `str`): Texts of the answers. - `answer_start` (`list` of `int`): Start positions for every answer text. ### Data Splits The number of questions and candidate sentences for each language for XQuAD-R is shown in the table below: | | XQuAD-R | | |-----|-----------|------------| | | questions | candidates | | ar | 1190 | 1222 | | de | 1190 | 1276 | | el | 1190 | 1234 | | en | 1190 | 1180 | | es | 1190 | 1215 | | hi | 1190 | 1244 | | ru | 1190 | 1219 | | th | 1190 | 852 | | tr | 1190 | 1167 | | vi | 1190 | 1209 | | zh | 1190 | 1196 | ## Dataset Creation [More Information Needed] ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data [More Information Needed] ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information [More Information Needed] ### Dataset Curators The dataset was initially created by Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips and Yinfei Yang, during work done at Google Research. ### Licensing Information XQuAD-R is distributed under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/legalcode). ### Citation Information ``` @article{roy2020lareqa, title={LAReQA: Language-agnostic answer retrieval from a multilingual pool}, author={Roy, Uma and Constant, Noah and Al-Rfou, Rami and Barua, Aditya and Phillips, Aaron and Yang, Yinfei}, journal={arXiv preprint arXiv:2004.05484}, year={2020} } ``` ### Contributions Thanks to [@manandey](https://github.com/manandey) for adding this dataset.
提供机构:
google-research-datasets
原始信息汇总

数据集卡片

数据集描述

数据集摘要

XQuAD-R 是 XQuAD 数据集的检索版本(一个跨语言的抽取式问答数据集)。与 XQuAD 类似,XQUAD-R 是一个 11 种语言的平行数据集,每个问题在 11 种不同语言中出现,并在这 11 种语言中各有 11 个平行正确答案。

支持的任务和排行榜

[更多信息需补充]

语言

数据集包含以下语言:

  • 阿拉伯语
  • 德语
  • 希腊语
  • 英语
  • 西班牙语
  • 印地语
  • 俄语
  • 泰语
  • 土耳其语
  • 越南语
  • 中文

数据集结构

数据实例

以下是 en 配置的一个示例: json { "id": "56beb4343aeaaa14008c925b", "context": "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections...", "question": "How many points did the Panthers defense surrender?", "answers": { "text": ["308"], "answer_start": [34] } }

数据字段

  • id (str): 上下文-问题对的唯一 ID。
  • context (str): 问题的上下文。
  • question (str): 问题。
  • answers (dict): 答案,包含以下键:
    • text (list of str): 答案的文本。
    • answer_start (list of int): 每个答案文本的起始位置。

数据分割

每个语言的 XQuAD-R 数据集的问答对数量如下:

语言 问题数量
ar 1190
de 1190
el 1190
en 1190
es 1190
hi 1190
ru 1190
th 1190
tr 1190
vi 1190
zh 1190

数据集创建

数据集策划理由

[更多信息需补充]

源数据

[更多信息需补充]

标注

[更多信息需补充]

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

数据集的偏见讨论

[更多信息需补充]

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

数据集最初由 Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips 和 Yinfei Yang 在 Google Research 工作期间创建。

许可信息

XQuAD-R 数据集在 CC BY-SA 4.0 许可 下发布。

引用信息

bibtex @article{roy2020lareqa, title={LAReQA: Language-agnostic answer retrieval from a multilingual pool}, author={Roy, Uma and Constant, Noah and Al-Rfou, Rami and Barua, Aditya and Phillips, Aaron and Yang, Yinfei}, journal={arXiv preprint arXiv:2004.05484}, year={2020} }

贡献

感谢 @manandey 添加此数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作