google-research-datasets/paws-x

Name: google-research-datasets/paws-x
Creator: google-research-datasets
Published: 2024-01-04 16:17:17
License: 暂无描述

Hugging Face2024-01-04 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/paws-x

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated - machine-generated language_creators: - expert-generated - machine-generated language: - de - en - es - fr - ja - ko - zh license: - other multilinguality: - multilingual size_categories: - 10K<n<100K source_datasets: - extended|other-paws task_categories: - text-classification task_ids: - semantic-similarity-classification - semantic-similarity-scoring - text-scoring - multi-input-text-classification paperswithcode_id: paws-x pretty_name: 'PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification' tags: - paraphrase-identification dataset_info: - config_name: de features: - name: id dtype: int32 - name: sentence1 dtype: string - name: sentence2 dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 12801784 num_examples: 49401 - name: test num_bytes: 524206 num_examples: 2000 - name: validation num_bytes: 514001 num_examples: 2000 download_size: 9601920 dataset_size: 13839991 - config_name: en features: - name: id dtype: int32 - name: sentence1 dtype: string - name: sentence2 dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 12215913 num_examples: 49401 - name: test num_bytes: 494726 num_examples: 2000 - name: validation num_bytes: 492279 num_examples: 2000 download_size: 9045005 dataset_size: 13202918 - config_name: es features: - name: id dtype: int32 - name: sentence1 dtype: string - name: sentence2 dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 12808446 num_examples: 49401 - name: test num_bytes: 519103 num_examples: 2000 - name: validation num_bytes: 513880 num_examples: 2000 download_size: 9538815 dataset_size: 13841429 - config_name: fr features: - name: id dtype: int32 - name: sentence1 dtype: string - name: sentence2 dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 13295557 num_examples: 49401 - name: test num_bytes: 535093 num_examples: 2000 - name: validation num_bytes: 533023 num_examples: 2000 download_size: 9785410 dataset_size: 14363673 - config_name: ja features: - name: id dtype: int32 - name: sentence1 dtype: string - name: sentence2 dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 15041592 num_examples: 49401 - name: test num_bytes: 668628 num_examples: 2000 - name: validation num_bytes: 661770 num_examples: 2000 download_size: 10435711 dataset_size: 16371990 - config_name: ko features: - name: id dtype: int32 - name: sentence1 dtype: string - name: sentence2 dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 13934181 num_examples: 49401 - name: test num_bytes: 562292 num_examples: 2000 - name: validation num_bytes: 554867 num_examples: 2000 download_size: 10263972 dataset_size: 15051340 - config_name: zh features: - name: id dtype: int32 - name: sentence1 dtype: string - name: sentence2 dtype: string - name: label dtype: class_label: names: '0': '0' '1': '1' splits: - name: train num_bytes: 10815459 num_examples: 49401 - name: test num_bytes: 474636 num_examples: 2000 - name: validation num_bytes: 473110 num_examples: 2000 download_size: 9178953 dataset_size: 11763205 configs: - config_name: de data_files: - split: train path: de/train-* - split: test path: de/test-* - split: validation path: de/validation-* - config_name: en data_files: - split: train path: en/train-* - split: test path: en/test-* - split: validation path: en/validation-* - config_name: es data_files: - split: train path: es/train-* - split: test path: es/test-* - split: validation path: es/validation-* - config_name: fr data_files: - split: train path: fr/train-* - split: test path: fr/test-* - split: validation path: fr/validation-* - config_name: ja data_files: - split: train path: ja/train-* - split: test path: ja/test-* - split: validation path: ja/validation-* - config_name: ko data_files: - split: train path: ko/train-* - split: test path: ko/test-* - split: validation path: ko/validation-* - config_name: zh data_files: - split: train path: zh/train-* - split: test path: zh/test-* - split: validation path: zh/validation-* --- # Dataset Card for PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) - **Repository:** [PAWS-X](https://github.com/google-research-datasets/paws/tree/master/pawsx) - **Paper:** [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828) - **Point of Contact:** [Yinfei Yang](yinfeiy@google.com) ### Dataset Summary This dataset contains 23,659 **human** translated PAWS evaluation pairs and 296,406 **machine** translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in [PAWS-Wiki](https://github.com/google-research-datasets/paws#paws-wiki). For further details, see the accompanying paper: [PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification](https://arxiv.org/abs/1908.11828) ### Supported Tasks and Leaderboards It has been majorly used for paraphrase identification for English and other 6 languages namely French, Spanish, German, Chinese, Japanese, and Korean ### Languages The dataset is in English, French, Spanish, German, Chinese, Japanese, and Korean ## Dataset Structure ### Data Instances For en: ``` id : 1 sentence1 : In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland . sentence2 : In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England . label : 0 ``` For fr: ``` id : 1 sentence1 : À Paris, en octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, lui demandant un passeport pour retourner en Angleterre en passant par l'Écosse. sentence2 : En octobre 1560, il rencontra secrètement l'ambassadeur d'Angleterre, Nicolas Throckmorton, à Paris, et lui demanda un passeport pour retourner en Écosse par l'Angleterre. label : 0 ``` ### Data Fields All files are in tsv format with four columns: Column Name | Data :---------- | :-------------------------------------------------------- id | An ID that matches the ID of the source pair in PAWS-Wiki sentence1 | The first sentence sentence2 | The second sentence label | Label for each pair The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki. ### Data Splits The numbers of examples for each of the seven languages are shown below: Language | Train | Dev | Test :------- | ------: | -----: | -----: en | 49,401 | 2,000 | 2,000 fr | 49,401 | 2,000 | 2,000 es | 49,401 | 2,000 | 2,000 de | 49,401 | 2,000 | 2,000 zh | 49,401 | 2,000 | 2,000 ja | 49,401 | 2,000 | 2,000 ko | 49,401 | 2,000 | 2,000 > **Caveat**: please note that the dev and test sets of PAWS-X are both sourced > from the dev set of PAWS-Wiki. As a consequence, the same `sentence 1` may > appear in both the dev and test sets. Nevertheless our data split guarantees > that there is no overlap on sentence pairs (`sentence 1` + `sentence 2`) > between dev and test. ## Dataset Creation ### Curation Rationale Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) (Zhang et al., 2019) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. They remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. They provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT (Devlin et al., 2019) fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information. ### Source Data PAWS (Paraphrase Adversaries from Word Scrambling) #### Initial Data Collection and Normalization All translated pairs are sourced from examples in [PAWS-Wiki](https://github.com/google-research-datasets/paws#paws-wiki) #### Who are the source language producers? This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. ### Annotations #### Annotation process If applicable, describe the annotation process and any tools used, or state otherwise. Describe the amount of data annotated, if not all. Describe or reference annotation guidelines provided to the annotators. If available, provide interannotator statistics. Describe any annotation validation processes. #### Who are the annotators? The paper mentions the translate team, especially Mengmeng Niu, for the help with the annotations. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ``` @InProceedings{pawsx2019emnlp, title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}}, author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason}, booktitle = {Proc. of EMNLP}, year = {2019} } ``` ### Contributions Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik), [@gowtham1997](https://github.com/gowtham1997) for adding this dataset.

提供机构：

google-research-datasets

原始信息汇总

数据集卡片 PAWS-X: 跨语言对抗性复述识别数据集

数据集描述

数据集摘要

PAWS-X 数据集包含 23,659 个人工翻译的 PAWS 评估对和 296,406 个机器翻译的训练对，涵盖六种不同语言：法语、西班牙语、德语、中文、日语和韩语。所有翻译对均源自 PAWS-Wiki 中的示例。

支持的任务和排行榜

该数据集主要用于英语和其他六种语言（法语、西班牙语、德语、中文、日语和韩语）的复述识别。

语言

数据集包含英语、法语、西班牙语、德语、中文、日语和韩语。

数据集结构

数据实例

对于英语（en）：

id : 1 sentence1 : In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland . sentence2 : In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England . label : 0

对于法语（fr）：

id : 1 sentence1 : À Paris, en octobre 1560, il rencontra secrètement lambassadeur dAngleterre, Nicolas Throckmorton, lui demandant un passeport pour retourner en Angleterre en passant par lÉcosse. sentence2 : En octobre 1560, il rencontra secrètement lambassadeur dAngleterre, Nicolas Throckmorton, à Paris, et lui demanda un passeport pour retourner en Écosse par lAngleterre. label : 0

数据字段

所有文件均为 tsv 格式，包含四列：

列名	数据
id	与 PAWS-Wiki 中源对的 ID 匹配的 ID
sentence1	第一句话
sentence2	第二句话
label	每对的标签

数据分割

每种语言的示例数量如下：

语言	训练	验证	测试
en	49,401	2,000	2,000
fr	49,401	2,000	2,000
es	49,401	2,000	2,000
de	49,401	2,000	2,000
zh	49,401	2,000	2,000
ja	49,401	2,000	2,000
ko	49,401	2,000	2,000

数据集创建

策划理由

大多数现有的对抗性数据生成工作集中在英语上。例如，PAWS（来自单词混排的复述对手）（Zhang et al., 2019）包含来自维基百科和 Quora 的具有挑战性的英语复述识别对。PAWS-X 填补了这一空白，提供 23,659 个人工翻译的 PAWS 评估对，涵盖六种不同语言：法语、西班牙语、德语、中文、日语和韩语。

源数据

PAWS（来自单词混排的复述对手）

初始数据收集和规范化

所有翻译对均源自 PAWS-Wiki 中的示例。

源语言生产者

数据集包含 23,659 个人工翻译的 PAWS 评估对和 296,406 个机器翻译的训练对，涵盖六种不同语言：法语、西班牙语、德语、中文、日语和韩语。

注释

注释过程

如果适用，描述注释过程和使用的任何工具，或声明否则。描述注释的数据量（如果不是全部）。提供给注释者的注释指南。如果可用，提供注释者间统计数据。描述任何注释验证过程。

注释者

论文中提到翻译团队，特别是 Mengmeng Niu，对注释工作提供了帮助。

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

列出参与收集数据集的人员及其所属机构。如果已知资金信息，请在此处包含。

许可信息

该数据集可自由用于任何目的，尽管承认 Google LLC（“Google”）作为数据源会受到赞赏。该数据集按“原样”提供，没有任何明示或暗示的保证。Google 对因使用该数据集而导致的任何直接或间接损害不承担任何责任。

引用信息

@InProceedings{pawsx2019emnlp, title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}}, author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason}, booktitle = {Proc. of EMNLP}, year = {2019} }

贡献

感谢 @bhavitvyamalik 和 @gowtham1997 添加此数据集。

搜集汇总

数据集介绍

构建方式

在跨语言自然语言处理领域，构建高质量的对等语料库是推动多语言模型发展的关键。PAWS-X数据集的构建基于PAWS-Wiki的原始英语语料，通过专家人工翻译与机器翻译相结合的方式，生成了涵盖法语、西班牙语、德语、中文、日语和韩语六种语言的对等句子对。具体而言，评估集包含23,659对人工翻译的句子对，确保了翻译的准确性与语言的地道性；训练集则包含296,406对机器翻译的句子对，有效扩展了数据规模。这种混合构建策略既保证了数据的质量，又实现了跨语言资源的规模化覆盖，为多语言语义相似性研究提供了坚实基础。

特点

PAWS-X数据集在跨语言语义相似性任务中展现出鲜明的特点。其核心在于覆盖了六种类型学上差异显著的语言，包括法语、西班牙语、德语、中文、日语和韩语，每种语言均包含49,401个训练样本及各2,000个验证与测试样本，确保了数据的平衡性与代表性。数据以句子对形式呈现，标注为二元分类标签，用于判断句子间是否构成释义关系。特别值得注意的是，该数据集采用了对抗性生成方法，通过词序重组等策略构建了具有挑战性的负样本，能够有效检验模型对句子结构与上下文的理解能力，为多语言模型的鲁棒性评估提供了重要基准。

使用方法

在自然语言处理研究中，PAWS-X数据集主要用于跨语言释义识别任务。使用者可通过HuggingFace平台直接加载数据，选择特定语言配置（如de、en、es等）以获取对应的训练、验证与测试分割。数据以TSV格式存储，包含id、sentence1、sentence2和label四个字段，便于进行模型训练与评估。典型应用包括微调多语言预训练模型（如mBERT），以提升模型在非英语语言上的语义相似性判断性能。需要注意的是，由于验证集与测试集均源自PAWS-Wiki的开发集，存在句子层面的部分重叠，但句子对层面保持独立，确保了评估的严谨性。

背景与挑战

背景概述

在自然语言处理领域，跨语言语义相似性评估一直是推动多语言模型发展的核心议题。PAWS-X数据集由谷歌研究院于2019年发布，旨在解决传统释义识别任务中语言多样性不足的局限。该数据集以PAWS-Wiki为基础，通过专家人工翻译与机器翻译相结合的方式，构建了涵盖德语、英语、西班牙语、法语、日语、韩语和中文七种语言的对抗性样本。其核心研究问题聚焦于跨语言语境下句子结构的深层语义对齐，为多语言预训练模型的评估提供了标准化基准，显著推动了跨语言语义理解技术的发展。

当前挑战

PAWS-X数据集面临的挑战主要体现在两方面：在领域问题层面，跨语言释义识别需克服语言间的结构差异与文化语境隔阂，例如日语助词体系与英语语序的对比可能削弱模型对语义等价性的判断；在构建过程中，人工翻译需保持原文的对抗性特征（如词序调换与近义词替换），而机器翻译则需解决低资源语言（如韩语）的语义保真度问题，同时确保训练集与测试集在句子对层面无重叠，这对数据划分策略提出了精确性要求。

常用场景

经典使用场景

在自然语言处理领域，跨语言语义相似性评估常面临数据稀缺的挑战。PAWS-X数据集通过提供七种语言的平行语料，成为跨语言复述识别任务的经典基准。研究者利用该数据集训练和评估多语言模型，以判断不同语言中句子对是否表达相同含义，尤其在处理词序变换和结构重组等复杂语义场景时，该数据集展现出独特的评测价值。

解决学术问题

该数据集有效解决了跨语言语义理解中的核心难题：如何克服语言壁垒实现精准的复述识别。通过构建多语言对抗性样本，它揭示了传统模型在捕捉非局部上下文和句法结构时的局限性，推动了多语言预训练技术的演进。其意义在于为衡量模型跨语言迁移能力提供了标准化尺度，显著提升了语义相似性任务在非英语语言上的研究深度。

衍生相关工作

围绕PAWS-X衍生的经典研究包括多语言BERT的微调策略探索、跨语言对抗训练方法的创新，以及基于语义图结构的复述检测模型。这些工作进一步拓展了XLM-R和InfoXLM等预训练架构的跨语言能力评测体系，并为后续发布的XTREME等多任务基准提供了关键数据支撑，持续推动着多语言表示学习领域的前沿进展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集