xquad_r

Name: xquad_r
Creator: maas
Published: 2025-12-05 16:41:07
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/google-research-datasets/xquad_r

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for [Dataset Name] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [LAReQA](https://github.com/google-research-datasets/lareqa) - **Repository:** [XQuAD-R](https://github.com/google-research-datasets/lareqa) - **Paper:** [LAReQA: Language-agnostic answer retrieval from a multilingual pool](https://arxiv.org/pdf/2004.05484.pdf) - **Point of Contact:** [Noah Constant](mailto:nconstant@google.com) ### Dataset Summary XQuAD-R is a retrieval version of the XQuAD dataset (a cross-lingual extractive QA dataset). Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question appears in 11 different languages and has 11 parallel correct answers across the languages. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The dataset can be found with the following languages: * Arabic: `xquad-r/ar.json` * German: `xquad-r/de.json` * Greek: `xquad-r/el.json` * English: `xquad-r/en.json` * Spanish: `xquad-r/es.json` * Hindi: `xquad-r/hi.json` * Russian: `xquad-r/ru.json` * Thai: `xquad-r/th.json` * Turkish: `xquad-r/tr.json` * Vietnamese: `xquad-r/vi.json` * Chinese: `xquad-r/zh.json` ## Dataset Structure [More Information Needed] ### Data Instances An example from `en` config: ``` {'id': '56beb4343aeaaa14008c925b', 'context': "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections. Pro Bowl defensive tackle Kawann Short led the team in sacks with 11, while also forcing three fumbles and recovering two. Fellow lineman Mario Addison added 6½ sacks. The Panthers line also featured veteran defensive end Jared Allen, a 5-time pro bowler who was the NFL's active career sack leader with 136, along with defensive end Kony Ealy, who had 5 sacks in just 9 starts. Behind them, two of the Panthers three starting linebackers were also selected to play in the Pro Bowl: Thomas Davis and Luke Kuechly. Davis compiled 5½ sacks, four forced fumbles, and four interceptions, while Kuechly led the team in tackles (118) forced two fumbles, and intercepted four passes of his own. Carolina's secondary featured Pro Bowl safety Kurt Coleman, who led the team with a career high seven interceptions, while also racking up 88 tackles and Pro Bowl cornerback Josh Norman, who developed into a shutdown corner during the season and had four interceptions, two of which were returned for touchdowns.", 'question': 'How many points did the Panthers defense surrender?', 'answers': {'text': ['308'], 'answer_start': [34]}} ``` ### Data Fields - `id` (`str`): Unique ID for the context-question pair. - `context` (`str`): Context for the question. - `question` (`str`): Question. - `answers` (`dict`): Answers with the following keys: - `text` (`list` of `str`): Texts of the answers. - `answer_start` (`list` of `int`): Start positions for every answer text. ### Data Splits The number of questions and candidate sentences for each language for XQuAD-R is shown in the table below: | | XQuAD-R | | |-----|-----------|------------| | | questions | candidates | | ar | 1190 | 1222 | | de | 1190 | 1276 | | el | 1190 | 1234 | | en | 1190 | 1180 | | es | 1190 | 1215 | | hi | 1190 | 1244 | | ru | 1190 | 1219 | | th | 1190 | 852 | | tr | 1190 | 1167 | | vi | 1190 | 1209 | | zh | 1190 | 1196 | ## Dataset Creation [More Information Needed] ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data [More Information Needed] ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information [More Information Needed] ### Dataset Curators The dataset was initially created by Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips and Yinfei Yang, during work done at Google Research. ### Licensing Information XQuAD-R is distributed under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/legalcode). ### Citation Information ``` @article{roy2020lareqa, title={LAReQA: Language-agnostic answer retrieval from a multilingual pool}, author={Roy, Uma and Constant, Noah and Al-Rfou, Rami and Barua, Aditya and Phillips, Aaron and Yang, Yinfei}, journal={arXiv preprint arXiv:2004.05484}, year={2020} } ``` ### Contributions Thanks to [@manandey](https://github.com/manandey) for adding this dataset.

# [数据集名称] 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**：[LAReQA](https://github.com/google-research-datasets/lareqa) - **代码仓库**：[XQuAD-R](https://github.com/google-research-datasets/lareqa) - **相关论文**：[LAReQA: 面向多语言候选池的语言无关答案检索](https://arxiv.org/pdf/2004.05484.pdf) - **联系方式**：[Noah Constant](mailto:nconstant@google.com) ### 数据集概述 XQuAD-R是XQuAD数据集（跨语言抽取式问答（extractive QA）数据集）的检索版本。与XQuAD一致，XQUAD-R属于11路并行数据集，即每个问题以11种不同语言呈现，并在各语言下对应11份并行的标准答案。 ### 支持任务与排行榜 [需补充更多信息] ### 语言覆盖本数据集包含以下语言的版本： * 阿拉伯语：`xquad-r/ar.json` * 德语：`xquad-r/de.json` * 希腊语：`xquad-r/el.json` * 英语：`xquad-r/en.json` * 西班牙语：`xquad-r/es.json` * 印地语：`xquad-r/hi.json` * 俄语：`xquad-r/ru.json` * 泰语：`xquad-r/th.json` * 土耳其语：`xquad-r/tr.json` * 越南语：`xquad-r/vi.json` * 汉语：`xquad-r/zh.json` ## 数据集结构 [需补充更多信息] ### 数据实例英语配置下的一条数据示例如下： {'id': '56beb4343aeaaa14008c925b', 'context': "黑豹队的防守组仅让对手得到308分，位列联盟第六；同时以24次抄截领跑美国国家橄榄球联盟（NFL），并拥有四名职业碗（Pro Bowl）球员。职业碗防守截锋卡万·肖特以11次擒杀领跑全队擒杀数，同时完成3次强制掉球和2次掉球回抢。同组防守线球员马里奥·阿迪森贡献了6.5次擒杀。黑豹队的防线还包括老将防守端锋贾里德·艾伦——这位五次入选职业碗的球员是NFL现役生涯擒杀王，总计136次擒杀——以及防守端锋科尼·伊利，他仅在9次首发中就完成了5次擒杀。在防线之后，黑豹队的三名首发线卫中有两人入选职业碗：托马斯·戴维斯和卢克·库奇利。戴维斯完成了5.5次擒杀、4次强制掉球和4次抄截；库奇利以118次擒抱领跑全队，完成2次强制掉球并抄截4次传球。卡罗莱纳黑豹队的二线防守阵容包括职业碗安全卫库尔特·科尔曼，他以职业生涯新高的7次抄截领跑全队，同时完成88次擒抱；以及职业碗角卫约什·诺曼，他在本赛季成长为一名封锁型角卫，完成4次抄截，其中2次直接回攻达阵。", 'question': "黑豹队的防守组让对手得到了多少分？", 'answers': {'text': ['308'], 'answer_start': [34]}} ### 数据字段 - `id`（`str`）：上下文-问题对的唯一标识符。 - `context`（`str`）：问题对应的上下文文本。 - `question`（`str`）：待解答的问题。 - `answers`（`dict`）：答案字典，包含以下键： - `text`（`list` of `str`）：答案文本列表。 - `answer_start`（`list` of `int`）：各答案文本在上下文中的起始位置。 ### 数据划分 XQuAD-R各语言版本的问题数与候选句数如下表所示： | | XQuAD-R | | |-----|-----------|------------| | | 问题数 | 候选句数 | | ar | 1190 | 1222 | | de | 1190 | 1276 | | el | 1190 | 1234 | | en | 1190 | 1180 | | es | 1190 | 1215 | | hi | 1190 | 1244 | | ru | 1190 | 1219 | | th | 1190 | 852 | | tr | 1190 | 1167 | | vi | 1190 | 1209 | | zh | 1190 | 1196 | ## 数据集构建 [需补充更多信息] ### 构建初衷 [需补充更多信息] ### 源数据 [需补充更多信息] #### 初始数据收集与归一化 [需补充更多信息] #### 源数据的语言创作者是谁？ [需补充更多信息] ### 标注信息 [需补充更多信息] #### 标注流程 [需补充更多信息] #### 标注人员是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 [需补充更多信息] ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 [需补充更多信息] ### 数据集维护者本数据集最初由Uma Roy、Noah Constant、Rami Al-Rfou、Aditya Barua、Aaron Phillips以及Yinfei Yang在谷歌研究院（Google Research）工作期间构建。 ### 授权信息 XQuAD-R采用[CC BY-SA 4.0授权协议](https://creativecommons.org/licenses/by-sa/4.0/legalcode)进行分发。 ### 引用信息 @article{roy2020lareqa, title={LAReQA: 面向多语言候选池的语言无关答案检索}, author={Roy, Uma and Constant, Noah and Al-Rfou, Rami and Barua, Aditya and Phillips, Aaron and Yang, Yinfei}, journal={arXiv preprint arXiv:2004.05484}, year={2020} } ### 贡献致谢感谢[@manandey](https://github.com/manandey)为本数据集添加收录。

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集