five

PASTED

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/Linzwcs/PASTED
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集名为PASTED,包含了分布内的训练集、验证集和测试集,以及一个用于检测AI改写的文本段的泛化测试集。数据集中既有由人类编写的文本,也有机器生成的文本,涵盖了多种改写风格。此外,该数据集不仅包括无上下文信息的改写,也包括了上下文相关的改写。数据被分为训练集、验证集和测试集,比例分别为80%、10%和10%。在分布内部分,数据集包含了83,089个实例(包括28,473个原文和54,616个改写文本);在分布外部分,有9,372个实例。该数据集的任务是检测改写的文本段。

This dataset is named PASTED. It comprises in-distribution training, validation, and test sets, alongside a generalized test set designed for detecting AI-paraphrased text segments. The dataset contains both human-written and machine-generated text, covering a wide range of paraphrasing styles. Additionally, it includes both context-free paraphrases and context-dependent paraphrases. The data is split into training, validation, and test sets with a ratio of 80%, 10%, and 10% respectively. For the in-distribution portion, the dataset contains 83,089 instances in total, including 28,473 original texts and 54,616 paraphrased texts; for the out-of-distribution portion, there are 9,372 instances. The core task of this dataset is to detect paraphrased text segments.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作