bmbgsj/AIGC-text-bank

Name: bmbgsj/AIGC-text-bank
Creator: bmbgsj
Published: 2026-04-22 14:17:43
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/bmbgsj/AIGC-text-bank

下载链接

链接失效反馈

官方服务：

资源简介：

AIGC-text-bank 是一个大规模、多领域的数据集，专为现实世界中的AI生成内容（AIGC）检测而设计。该数据集在论文《通过对齐和增强实现推理感知的AIGC检测》中提出。数据集包含149万样本，超过4.23亿标记，涵盖10个不同领域（学术、新闻、社交媒体等），由12种不同的LLM生成，包括GPT-4o、DeepSeek-R1、Llama-3.3-70B、Grok 4.1和Phi-4。数据集分为三个核心类别：`Human`（真实人类写作，严格在ChatGPT发布前收集）、`AI-Native`（完全由机器生成的文本，与人类参考文本语义对齐且长度匹配）和`AI-Polish`（人类撰写并由AI改进的文本，保持原始语义的同时提升流畅性和风格）。

AIGC-text-bank is a large-scale, multi-domain dataset designed for real-world AI-generated content (AIGC) detection. It is introduced in the paper "Reasoning-Aware AIGC Detection via Alignment and Reinforcement." The dataset features 1.49 million samples with over 423 million tokens, sourced from 10 distinct domains (Academic, News, Social Media, etc.) and generated by 12 state-of-the-art LLMs, including GPT-4o, DeepSeek-R1, Llama-3.3-70B, Grok 4.1, and Phi-4. It consists of three core categories: `Human` (authentic human writing strictly collected before the release of ChatGPT), `AI-Native` (fully machine-generated text, semantically aligned and length-matched with human references), and `AI-Polish` (human-authored text refined by AI to improve fluency and style while preserving original semantics).

提供机构：

bmbgsj

5,000+

优质数据集

54 个

任务类型

进入经典数据集