flax-sentence-embeddings/paws-jsonl
收藏Hugging Face2021-07-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/flax-sentence-embeddings/paws-jsonl
下载链接
链接失效反馈官方服务:
资源简介:
# Introduction
This dataset is a jsonl format for PAWS dataset from: https://github.com/google-research-datasets/paws. It only contains the `PAWS-Wiki Labeled (Final)` and
`PAWS-Wiki Labeled (Swap-only)` training sections of the original PAWS dataset. Duplicates data are removed.
Each line contains a dict in the following format:
`{"guid": <id>, "texts": [anchor, positive]}` or
`{"guid": <id>, "texts": [anchor, positive, negative]}`
positives_negatives.jsonl.gz: 24,723
positives_only.jsonl.gz: 13,487
**Total**: 38,210
## Dataset summary
[**PAWS: Paraphrase Adversaries from Word Scrambling**](https://github.com/google-research-datasets/paws)
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.
提供机构:
flax-sentence-embeddings
原始信息汇总
数据集概述
数据集来源
- 数据集来源于PAWS,具体包含
PAWS-Wiki Labeled (Final)和PAWS-Wiki Labeled (Swap-only)的训练部分。
数据集格式
- 数据集采用jsonl格式,每行包含一个字典,格式为:
{"guid": <id>, "texts": [anchor, positive]}{"guid": <id>, "texts": [anchor, positive, negative]}
数据集文件及大小
positives_negatives.jsonl.gz: 24,723条记录positives_only.jsonl.gz: 13,487条记录- 总计: 38,210条记录
数据集内容
- 包含108,463个人工标注的配对和656k噪声标注的配对,用于强调建模结构、上下文和词序信息在释义识别问题中的重要性。
- 数据集分为两个子集,一个基于维基百科,另一个基于Quora Question Pairs (QQP)数据集。



