flax-sentence-embeddings/paws-jsonl

Name: flax-sentence-embeddings/paws-jsonl
Creator: flax-sentence-embeddings
Published: 2021-07-02 10:19:03
License: 暂无描述

Hugging Face2021-07-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/flax-sentence-embeddings/paws-jsonl

下载链接

链接失效反馈

官方服务：

资源简介：

# Introduction This dataset is a jsonl format for PAWS dataset from: https://github.com/google-research-datasets/paws. It only contains the `PAWS-Wiki Labeled (Final)` and `PAWS-Wiki Labeled (Swap-only)` training sections of the original PAWS dataset. Duplicates data are removed. Each line contains a dict in the following format: `{"guid": <id>, "texts": [anchor, positive]}` or `{"guid": <id>, "texts": [anchor, positive, negative]}` positives_negatives.jsonl.gz: 24,723 positives_only.jsonl.gz: 13,487 **Total**: 38,210 ## Dataset summary [**PAWS: Paraphrase Adversaries from Word Scrambling**](https://github.com/google-research-datasets/paws) This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.

提供机构：

flax-sentence-embeddings

原始信息汇总

数据集概述

数据集来源

数据集来源于PAWS，具体包含PAWS-Wiki Labeled (Final)和PAWS-Wiki Labeled (Swap-only)的训练部分。

数据集格式

数据集采用jsonl格式，每行包含一个字典，格式为：
- {"guid": <id>, "texts": [anchor, positive]}
- {"guid": <id>, "texts": [anchor, positive, negative]}

数据集文件及大小

positives_negatives.jsonl.gz: 24,723条记录
positives_only.jsonl.gz: 13,487条记录
总计: 38,210条记录

数据集内容

包含108,463个人工标注的配对和656k噪声标注的配对，用于强调建模结构、上下文和词序信息在释义识别问题中的重要性。
数据集分为两个子集，一个基于维基百科，另一个基于Quora Question Pairs (QQP)数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集