xwjzds/pretrain_sts_long

Name: xwjzds/pretrain_sts_long
Creator: xwjzds
Published: 2023-11-24 22:08:25
License: 暂无描述

Hugging Face2023-11-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/xwjzds/pretrain_sts_long

下载链接

链接失效反馈

官方服务：

资源简介：

Sentence_Paraphase数据集是来自不同来源的句子改写任务的集合，包括使用ChatGPT的改写、来自PAWS的改写和STS基准。数据集经过过滤，去除了非英语、过短或相似度不高的句子对。

The Sentence Paraphase Collections dataset is a collection for sentence paraphrasing tasks, including paraphrasing tasks from various sources such as ChatGPT, PAWS, and STS benchmark. The dataset has been filtered to exclude non-English, too short, or low similarity score sentence pairs. It contains two main fields: input and output, both of which are string types representing paraphrased sentences or paragraphs. The dataset is divided into a training set with 38,151 samples. The dataset is licensed under Creative Commons NonCommercial (CC BY-NC 4.0).

提供机构：

xwjzds

原始信息汇总

数据集概述

数据集信息

特征:
- input: 类型为字符串
- output: 类型为字符串
分割:
- train: 字节数为9557417，样本数为38151
下载大小: 6115013字节
数据集大小: 9557417字节

数据集摘要

任务类型: 句子复述任务，来源包括ChatGPT复述、PAWS和STS基准测试。
类别数量: 复述任务共223241条数据。

数据结构

数据实例: json { "input": "U.S. prosecutors have arrested more than 130 individuals and have seized more than $17 million in a continuing crackdown on Internet fraud and abuse.", "output": "More than 130 people have been arrested and $17 million worth of property seized in an Internet fraud sweep announced Friday by three U.S. government agencies." }
数据字段:
- input 和 output 是句子或段落的复述。

数据集创建

创建理由: [更多信息需补充]
初始数据收集和规范化: [更多信息需补充]
源语言生产者: [更多信息需补充]
注释过程: [更多信息需补充]
注释者: [更多信息需补充]
个人和敏感信息: [更多信息需补充]

使用数据集的考虑

社会影响: [更多信息需补充]
偏见讨论: [更多信息需补充]
其他已知限制: [更多信息需补充]

附加信息

数据集策展人: [更多信息需补充]
许可信息: 数据集在Creative Commons NonCommercial (CC BY-NC 4.0)下可用。
引用信息: plaintext @misc{xu2023detime, title={DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM}, author={Weijie Xu and Wenxiang Hu and Fanyou Wu and Srinivasan Sengamedu}, year={2023}, eprint={2310.15296}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集