paws-x

Name: paws-x
Creator: maas
Published: 2026-01-06 16:38:05
License: 暂无描述

魔搭社区2026-01-06 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/google-research-datasets/paws-x

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) - **Repository:** [PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) - **Paper:** [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828) - **Point of Contact:** [Yinfei Yang](yinfeiy@google.com) ### Dataset Summary This dataset contains 23,659 **human** translated PAWS evaluation pairs and 296,406 **machine** translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in [PAWS-Wiki](https://github.com/google-research-datasets/paws#paws-wiki). For further details, see the accompanying paper: [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828) ### Supported Tasks and Leaderboards It has been majorly used for paraphrase identification for English and other 6 languages namely French, Spanish, German, Chinese, Japanese, and Korean ### Languages The dataset is in English, French, Spanish, German, Chinese, Japanese, and Korean ## Dataset Structure ### Data Instances For en: ``` id : 1 sentence1 : In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland . sentence2 : In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England . label : 0 ``` For fr: ``` id : 1 sentence1 : À Paris, en octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, lui demandant un passeport pour retourner en Angleterre en passant par l'Écosse. sentence2 : En octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, à Paris, et lui demanda un passeport pour retourner en Écosse par l'Angleterre. label : 0 ``` ### Data Fields All files are in tsv format with four columns: Column Name | Data :---------- | :-------------------------------------------------------- id | An ID that matches the ID of the source pair in PAWS-Wiki sentence1 | The first sentence sentence2 | The second sentence label | Label for each pair The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki. ### Data Splits The numbers of examples for each of the seven languages are shown below: Language | Train | Dev | Test :------- | ------: | -----: | -----: en | 49,401 | 2,000 | 2,000 fr | 49,401 | 2,000 | 2,000 es | 49,401 | 2,000 | 2,000 de | 49,401 | 2,000 | 2,000 zh | 49,401 | 2,000 | 2,000 ja | 49,401 | 2,000 | 2,000 ko | 49,401 | 2,000 | 2,000 > **Caveat**: please note that the dev and test sets of PAWS-X are both sourced > from the dev set of PAWS-Wiki. As a consequence, the same `sentence 1` may > appear in both the dev and test sets. Nevertheless our data split guarantees > that there is no overlap on sentence pairs (`sentence 1` + `sentence 2`) > between dev and test. ## Dataset Creation ### Curation Rationale Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) (Zhang et al., 2019) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. They remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. They provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT (Devlin et al., 2019) fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information. ### Source Data PAWS (Paraphrase Adversaries from Word Scrambling) #### Initial Data Collection and Normalization All translated pairs are sourced from examples in [PAWS-Wiki](https://github.com/google-research-datasets/paws#paws-wiki) #### Who are the source language producers? This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. ### Annotations #### Annotation process If applicable, describe the annotation process and any tools used, or state otherwise. Describe the amount of data annotated, if not all. Describe or reference annotation guidelines provided to the annotators. If available, provide interannotator statistics. Describe any annotation validation processes. #### Who are the annotators? The paper mentions the translate team, especially Mengmeng Niu, for the help with the annotations. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ``` @InProceedings{pawsx2019emnlp, title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}}, author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason}, booktitle = {Proc. of EMNLP}, year = {2019} } ``` ### Contributions Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik), [@gowtham1997](https://github.com/gowtham1997) for adding this dataset.

# PAWS-X 数据集卡片：用于释义识别的跨语言对抗数据集 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [涉及语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注流程](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**：[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) - **代码仓库**：[PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) - **相关论文**：[PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828) - **联系方式**：[Yinfei Yang](yinfeiy@google.com) ### 数据集概述本数据集包含23,659条**人工**翻译的PAWS评测样本对，以及296,406条**机器**翻译的训练样本对，涵盖六种类型学特征差异显著的语言：法语、西班牙语、德语、中文、日语及韩语。所有翻译样本对均源自[PAWS-Wiki](https://github.com/google-research-datasets/paws#paws-wiki)中的原始样本。如需了解更多细节，请参阅配套论文：[PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828) ### 支持任务与基准榜单本数据集主要应用于英语及另外六种语言（法语、西班牙语、德语、中文、日语、韩语）的释义识别任务。 ### 涉及语言本数据集涉及语言包括英语、法语、西班牙语、德语、中文、日语及韩语。 ## 数据集结构 ### 数据实例针对英语（en）的示例： id : 1 sentence1 : In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland . sentence2 : In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England . label : 0 针对法语（fr）的示例： id : 1 sentence1 : À Paris, en octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, lui demandant un passeport pour retourner en Angleterre en passant par l'Écosse. sentence2 : En octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, à Paris, et lui demanda un passeport pour retourner en Écosse par l'Angleterre. label : 0 ### 数据字段所有文件均采用TSV格式，包含四列数据，详情如下： | 列名 | 数据说明 | |:---------- |:-------------------------------------------------------- | | id | 与PAWS-Wiki中源样本对编号一致的唯一标识 | | sentence1 | 第一条句子 | | sentence2 | 第二条句子 | | label | 样本对对应的标签 | 可通过在PAWS-Wiki的对应文件中查询样本编号，获取每条翻译样本的源文本内容。 ### 数据划分七种语言的样本数量划分如下表所示： | 语言 | 训练集 | 验证集 | 测试集 | |:------- | ------: | -----: | -----: | | en | 49,401 | 2,000 | 2,000 | | fr | 49,401 | 2,000 | 2,000 | | es | 49,401 | 2,000 | 2,000 | | de | 49,401 | 2,000 | 2,000 | | zh | 49,401 | 2,000 | 2,000 | | ja | 49,401 | 2,000 | 2,000 | | ko | 49,401 | 2,000 | 2,000 | > **注意事项**：请注意，PAWS-X的验证集与测试集均源自PAWS-Wiki的验证集。因此，同一条`sentence 1`可能同时出现在验证集与测试集中。但本数据集的划分规则保证了验证集与测试集之间不存在完整的样本对（`sentence 1` + `sentence 2`）重叠。 ## 数据集构建 ### 构建初衷当前绝大多数对抗数据生成相关研究均聚焦于英语场景。例如，PAWS（Paraphrase Adversaries from Word Scrambling，即单词打乱式释义对抗样本）（Zhang等人，2019）包含源自维基百科与Quora的高难度英语释义识别样本对。本数据集PAWS-X旨在填补这一空白，共包含六种类型学特征差异显著的语言下的23,659条人工翻译PAWS评测样本对。研究团队为三种不同模型提供了基准评测结果，这些模型在捕捉非局部上下文与句子结构的能力上存在差异，且采用了不同的多语言训练与评测范式。其中，在PAWS英语数据集与机器翻译数据上微调的多语言BERT（Multilingual BERT，Devlin等人，2019）表现最优，在非英语语言上的准确率区间为83.1%-90.8%，相较次优模型平均提升23%的准确率。 PAWS-X验证了深度多语言预训练的有效性，同时也为后续研究留下了充足的提升空间，可作为推动更精准捕捉语言结构与上下文信息的多语言研究的新挑战基准。 ### 源数据源数据为PAWS（Paraphrase Adversaries from Word Scrambling）。 #### 初始数据收集与归一化所有翻译样本对均源自[PAWS-Wiki](https://github.com/google-research-datasets/paws#paws-wiki)中的原始样本。 #### 源语言文本的创作者是谁？本数据集包含23,659条人工翻译的PAWS评测样本对，以及296,406条机器翻译的训练样本对，涵盖六种类型学特征差异显著的语言：法语、西班牙语、德语、中文、日语及韩语。 ### 标注流程 #### 标注过程若适用，请描述标注流程与所用工具，或另行说明。若未对全部数据进行标注，请说明已标注的数据规模。请描述或引用提供给标注人员的标注指南，若有可用的标注者间一致性统计数据，请一并提供，并说明任何标注验证流程。 #### 标注人员是谁？本论文提及翻译团队在标注工作中提供了帮助，其中特别感谢牛萌萌（Mengmeng Niu）的协助。 ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者请列出参与数据集收集的人员及其所属机构，若有公开的资助信息，可一并补充于此。 ### 授权信息本数据集可免费用于任何用途，若能注明谷歌有限责任公司（"谷歌"）为数据来源，我们将不胜感激。本数据集按"现状"提供，不附带任何明示或暗示的担保。谷歌对因使用本数据集而产生的任何直接或间接损害不承担任何责任。 ### 引用信息 @InProceedings{pawsx2019emnlp, title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}}, author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason}, booktitle = {Proc. of EMNLP}, year = {2019} } ### 贡献致谢感谢 [@bhavitvyamalik](https://github.com/bhavitvyamalik)、[@gowtham1997](https://github.com/gowtham1997) 为本数据集的入库贡献。

提供机构：

maas

创建时间：

2025-07-07

搜集汇总

数据集介绍