five

paws

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/paws
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for PAWS: Paraphrase Adversaries from Word Scrambling ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [PAWS](https://github.com/google-research-datasets/paws) - **Repository:** [PAWS](https://github.com/google-research-datasets/paws) - **Paper:** [PAWS: Paraphrase Adversaries from Word Scrambling](https://arxiv.org/abs/1904.01130) - **Point of Contact:** [Yuan Zhang](zhangyua@google.com) ### Dataset Summary PAWS: Paraphrase Adversaries from Word Scrambling This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset. For further details, see the accompanying paper: PAWS: Paraphrase Adversaries from Word Scrambling (https://arxiv.org/abs/1904.01130) PAWS-QQP is not available due to license of QQP. It must be reconstructed by downloading the original data and then running our scripts to produce the data and attach the labels. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The text in the dataset is in English. ## Dataset Structure ### Data Instances Below are two examples from the dataset: | | Sentence 1 | Sentence 2 | Label | | :-- | :---------------------------- | :---------------------------- | :---- | | (1) | Although interchangeable, the body pieces on the 2 cars are not similar. | Although similar, the body parts are not interchangeable on the 2 cars. | 0 | | (2) | Katz was born in Sweden in 1947 and moved to New York City at the age of 1. | Katz was born in 1947 in Sweden and moved to New York at the age of one. | 1 | The first pair has different semantic meaning while the second pair is a paraphrase. State-of-the-art models trained on existing datasets have dismal performance on PAWS (<40% accuracy); however, including PAWS training data for these models improves their accuracy to 85% while maintaining performance on existing datasets such as the [Quora Question Pairs](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs). ### Data Fields This corpus contains pairs generated from Wikipedia pages, and can be downloaded here: * **PAWS-Wiki Labeled (Final)**: containing pairs that are generated from both word swapping and back translation methods. All pairs have human judgements on both paraphrasing and fluency and they are split into Train/Dev/Test sections. * **PAWS-Wiki Labeled (Swap-only)**: containing pairs that have no back translation counterparts and therefore they are not included in the first set. Nevertheless, they are high-quality pairs with human judgements on both paraphrasing and fluency, and they can be included as an auxiliary training set. * **PAWS-Wiki Unlabeled (Final)**: Pairs in this set have noisy labels without human judgments and can also be used as an auxiliary training set. They are generated from both word swapping and back translation methods. All files are in the tsv format with four columns: Column Name | Data :------------ | :-------------------------- id | A unique id for each pair sentence1 | The first sentence sentence2 | The second sentence (noisy_)label | (Noisy) label for each pair Each label has two possible values: `0` indicates the pair has different meaning, while `1` indicates the pair is a paraphrase. ### Data Splits The number of examples and the proportion of paraphrase (Yes%) pairs are shown below: Data | Train | Dev | Test | Yes% :------------------ | ------: | -----: | ----: | ----: Labeled (Final) | 49,401 | 8,000 | 8,000 | 44.2% Labeled (Swap-only) | 30,397 | -- | -- | 9.6% Unlabeled (Final) | 645,652 | 10,000 | -- | 50.0% ## Dataset Creation ### Curation Rationale Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like *flights from New York to Florida* and *flights from Florida to New York*. ### Source Data #### Initial Data Collection and Normalization Their automatic generation method is based on two ideas. The first swaps words to generate a sentence pair with the same BOW, controlled by a language model. The second uses back translation to generate paraphrases with high BOW overlap but different word order. These two strategies generate high-quality, diverse PAWS pairs, balanced evenly between paraphrases and non-paraphrases. #### Who are the source language producers? Mentioned above. ### Annotations #### Annotation process Sentence pairs are presented to five annotators, each of which gives a binary judgment as to whether they are paraphrases or not. They chose binary judgments to make dataset have the same label schema as the QQP corpus. Overall, human agreement is high on both Quora (92.0%) and Wikipedia (94.7%) and each label only takes about 24 seconds. As such, answers are usually straight-forward to human raters. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators List the people involved in collecting the dataset and their affiliation(s). If funding information is known, include it here. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ``` @InProceedings{paws2019naacl, title = {{PAWS: Paraphrase Adversaries from Word Scrambling}}, author = {Zhang, Yuan and Baldridge, Jason and He, Luheng}, booktitle = {Proc. of NAACL}, year = {2019} } ``` ### Contributions Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.

# PAWS数据集卡片:基于单词打乱生成的释义对抗样本(PAWS: Paraphrase Adversaries from Word Scrambling) ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与评测基准](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注流程](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页:** [PAWS](https://github.com/google-research-datasets/paws) - **代码仓库:** [PAWS](https://github.com/google-research-datasets/paws) - **相关论文:** [PAWS: Paraphrase Adversaries from Word Scrambling](https://arxiv.org/abs/1904.01130) - **联系方式:** [Yuan Zhang](zhangyua@google.com) ### 数据集概述 PAWS:基于单词打乱生成的释义对抗样本 本数据集包含108,463条人工标注样本与65.6万条带噪声标注样本,旨在体现建模结构、上下文与词序信息在释义识别任务中的重要性。本数据集包含两个子集,分别基于维基百科(Wikipedia)与Quora问答对(Quora Question Pairs,QQP)构建。 如需更多细节,请参阅配套论文:*PAWS: Paraphrase Adversaries from Word Scrambling*(https://arxiv.org/abs/1904.01130)。 由于QQP的许可协议限制,PAWS-QQP子集无法直接获取,用户需通过下载原始数据集并运行我们提供的脚本生成数据并添加标注后自行构建。 ### 支持任务与评测基准 [需补充更多信息] ### 语言 本数据集的文本语言为英语。 ## 数据集结构 ### 数据实例 以下为本数据集的两个示例: | | 句子1 | 句子2 | 标签 | | :-- | :---------------------------- | :---------------------------- | :---- | | (1) | 尽管可互换,但两款车的车身部件并不相似。 | 尽管相似,但两款车的车身部件并不可互换。 | 0 | | (2) | 卡茨于1947年在瑞典出生,1岁时搬至纽约市。 | 卡茨于1947年在瑞典出生,1岁时搬至纽约。 | 1 | 第一个样本对语义存在差异,而第二个样本对为释义。在现有数据集上训练的前沿模型在PAWS数据集上的表现极差(准确率低于40%);但将PAWS的训练数据加入模型训练后,可将准确率提升至85%,同时在Quora问答对(Quora Question Pairs,QQP)等现有数据集上的性能保持不变。 ### 数据字段 本语料库包含从维基百科页面生成的样本对,可通过以下链接下载: * **PAWS-Wiki 标注集(最终版)**:包含通过单词替换与回译两种方法生成的样本对。所有样本对均包含人工标注的释义与流畅度标签,并划分为训练集、开发集与测试集。 * **PAWS-Wiki 仅替换标注集**:包含未经过回译步骤的样本对,因此未被纳入上述最终版标注集。尽管如此,该子集仍为高质量样本对,同样包含人工标注的释义与流畅度标签,可作为辅助训练集使用。 * **PAWS-Wiki 无标注集(最终版)**:该子集的样本对仅包含带噪声的自动标注,无人工审核标签,同样可作为辅助训练集使用。该子集同样通过单词替换与回译两种方法生成。 所有文件均采用TSV(Tab-Separated Values,制表符分隔值)格式,包含四列数据: | 列名 | 数据说明 | | :------------ | :----------------------------------------------------------------------- | | id | 每个样本对的唯一标识符 | | sentence1 | 第一个句子 | | sentence2 | 第二个句子 | | (noisy_)label | 样本对的(带噪声)标签:`0`表示样本对语义不同,`1`表示样本对为释义对。 | ### 数据划分 各子集的样本数量与释义样本占比(Yes%)如下表所示: | 数据集划分 | 训练集 | 开发集 | 测试集 | 释义样本占比 | | :----------------------- | ------: | -----: | ----: | ----: | | 标注集(最终版) | 49,401 | 8,000 | 8,000 | 44.2% | | 仅替换标注集 | 30,397 | -- | -- | 9.6% | | 无标注集(最终版) | 645,652 | 10,000 | -- | 50.0% | ## 数据集构建 ### 构建初衷 现有释义识别数据集缺少那些词汇重叠度高但并非释义的句子对。在这类数据集上训练的模型无法区分诸如“从纽约飞往佛罗里达的航班”与“从佛罗里达飞往纽约的航班”这类样本对。 ### 源数据 #### 初始数据收集与标准化 本数据集的自动生成方法基于两个思路:其一为通过语言模型控制生成具有相同词袋(Bag-of-Words, BOW)表示的句子对;其二为通过回译生成词袋重叠度高但词序不同的释义对。这两种策略可生成高质量、多样化的PAWS样本对,且释义对与非释义对的分布较为均衡。 #### 源语言生产者是谁? 如上文所述。 ### 标注流程 #### 标注过程 将句子对提交给5名标注人员,每名标注人员需给出二元判断:该样本对是否为释义。我们采用二元标注方案以确保与QQP语料库的标注规则保持一致。整体而言,标注人员在Quora数据集上的标注一致性高达92.0%,在维基百科数据集上则为94.7%,且每条样本的标注耗时仅约24秒。因此,对人类标注者而言,该标注任务通常较为直观。 #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限 [需补充更多信息] ## 附加信息 ### 数据集维护者 列出参与数据集收集的人员及其所属机构。如存在资助信息,请一并附上。 ### 许可信息 本数据集可免费用于任何用途,若能注明Google LLC(“谷歌”)为数据源将不胜感激。本数据集按“现状”提供,不附带任何明示或暗示的担保。谷歌对因使用本数据集而导致的任何直接或间接损害不承担任何责任。 ### 引用信息 @InProceedings{paws2019naacl, title = {{PAWS: Paraphrase Adversaries from Word Scrambling}}, author = {Zhang, Yuan and Baldridge, Jason and He, Luheng}, booktitle = {Proc. of NAACL}, year = {2019} } ### 贡献者 感谢 [@bhavitvyamalik](https://github.com/bhavitvyamalik) 为本数据集添加的相关内容。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作