xwjzds/paraphrase_collections

Name: xwjzds/paraphrase_collections
Creator: xwjzds
Published: 2023-11-22 23:07:41
License: 暂无描述

Hugging Face2023-11-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/xwjzds/paraphrase_collections

下载链接

链接失效反馈

官方服务：

资源简介：

Sentence Paraphase Collections 是一个用于句子转述任务的数据集，结合了来自不同来源的句子转述任务，如使用ChatGPT生成的转述、PAWS（Paraphrase Adversaries from Word Scrambling）和STS基准测试。数据集经过过滤，去除了非英语、过短或相似度不高的句子对。数据集包含223,241个转述对，每个实例包含一个输入句子和其对应的转述输出。数据集的创建细节、注释过程、源数据生产者等信息未提供。数据集的使用需遵循Creative Commons NonCommercial (CC BY-NC 4.0)许可。

提供机构：

xwjzds

原始信息汇总

数据集概述

数据集名称

Sentence Paraphase Collections

数据集描述

Sentence Paraphase 是一个结合了多种来源的句子改写任务的数据集，包括使用ChatGPT进行改写、Paraphrase Adversaries from Word Scrambling (PAWS) 和 STS benchmark。该数据集过滤掉了非英语、过短或相似度不高的配对。

数据集结构

特征（Features）：
- input：字符串类型
- output：字符串类型
数据实例（Data Instances）：
- 示例：
  
  {input: U.S. prosecutors have arrested more than 130 individuals and have seized more than $17 million in a continuing crackdown on Internet fraud and abuse., output: More than 130 people have been arrested and $17 million worth of property seized in an Internet fraud sweep announced Friday by three U.S. government agencies.}

数据集统计

类别计数：
- Paraphrase: 223241

数据集大小

下载大小：21377198字节
数据集大小：34347236字节
训练集：
- 字节数：34347236
- 示例数：223241

许可证信息

该数据集根据Creative Commons NonCommercial (CC BY-NC 4.0)许可证提供。

引用信息

@misc{xu2023detime, title={DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM}, author={Weijie Xu and Wenxiang Hu and Fanyou Wu and Srinivasan Sengamedu}, year={2023}, eprint={2310.15296}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集