xwjzds/pretrain_sts_long
收藏Hugging Face2023-11-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/xwjzds/pretrain_sts_long
下载链接
链接失效反馈官方服务:
资源简介:
Sentence_Paraphase数据集是来自不同来源的句子改写任务的集合,包括使用ChatGPT的改写、来自PAWS的改写和STS基准。数据集经过过滤,去除了非英语、过短或相似度不高的句子对。
The Sentence Paraphase Collections dataset is a collection for sentence paraphrasing tasks, including paraphrasing tasks from various sources such as ChatGPT, PAWS, and STS benchmark. The dataset has been filtered to exclude non-English, too short, or low similarity score sentence pairs. It contains two main fields: input and output, both of which are string types representing paraphrased sentences or paragraphs. The dataset is divided into a training set with 38,151 samples. The dataset is licensed under Creative Commons NonCommercial (CC BY-NC 4.0).
提供机构:
xwjzds
原始信息汇总
数据集概述
数据集信息
- 特征:
input: 类型为字符串output: 类型为字符串
- 分割:
train: 字节数为9557417,样本数为38151
- 下载大小: 6115013字节
- 数据集大小: 9557417字节
数据集摘要
- 任务类型: 句子复述任务,来源包括ChatGPT复述、PAWS和STS基准测试。
- 类别数量: 复述任务共223241条数据。
数据结构
-
数据实例: json { "input": "U.S. prosecutors have arrested more than 130 individuals and have seized more than $17 million in a continuing crackdown on Internet fraud and abuse.", "output": "More than 130 people have been arrested and $17 million worth of property seized in an Internet fraud sweep announced Friday by three U.S. government agencies." }
-
数据字段:
input和output是句子或段落的复述。
数据集创建
- 创建理由: [更多信息需补充]
- 初始数据收集和规范化: [更多信息需补充]
- 源语言生产者: [更多信息需补充]
- 注释过程: [更多信息需补充]
- 注释者: [更多信息需补充]
- 个人和敏感信息: [更多信息需补充]
使用数据集的考虑
- 社会影响: [更多信息需补充]
- 偏见讨论: [更多信息需补充]
- 其他已知限制: [更多信息需补充]
附加信息
- 数据集策展人: [更多信息需补充]
- 许可信息: 数据集在Creative Commons NonCommercial (CC BY-NC 4.0)下可用。
- 引用信息: plaintext @misc{xu2023detime, title={DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM}, author={Weijie Xu and Wenxiang Hu and Fanyou Wu and Srinivasan Sengamedu}, year={2023}, eprint={2310.15296}, archivePrefix={arXiv}, primaryClass={cs.CL} }



