完全合成数据集和部分文本替换数据集

Name: 完全合成数据集和部分文本替换数据集
Creator: 巴黎第十三大学，LIPN，CNRS，UMR 7030，法国维莱塔内西
Published: 2022-04-29 20:04:33
License: 暂无描述

arXiv2022-04-29 更新2024-06-21 收录

下载链接：

https://github.com/vijini/GeneratedTextDetection.git

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集由巴黎第十三大学LIPN实验室创建，旨在为检测学术出版物中自动生成文本提供基准。数据集包括两个部分：完全合成数据集和部分文本替换数据集。前者由GPT-2模型根据原始论文提示生成，后者则是替换摘要中的部分句子为Arxiv-NLP模型生成内容。数据集通过BLEU和ROUGE等流畅度指标进行评估，旨在提高自动生成文本的检测难度，适用于评估和改进文本生成检测技术。

This dataset was created by the LIPN Laboratory of Paris 13 University, aiming to provide a benchmark for detecting automatically generated text in academic publications. The dataset consists of two parts: a fully synthetic dataset and a partially text-replaced dataset. The former is generated by the GPT-2 model based on prompts from original academic papers, while the latter replaces some sentences in the abstracts with content generated by the Arxiv-NLP model. Evaluated using fluency metrics such as BLEU and ROUGE, this dataset is designed to increase the difficulty of detecting automatically generated text, and is suitable for evaluating and improving text generation detection techniques.

提供机构：

巴黎第十三大学，LIPN，CNRS，UMR 7030，法国维莱塔内西

创建时间：

2022-02-04