A Multi-level Gradient Rewriting Dataset of Chinese Academic Paper Abstracts for AIGC Detection MGRD
收藏DataCite Commons2026-04-30 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=b6ceb6d54d0e4ff98f024c6433bb4424
下载链接
链接失效反馈官方服务:
资源简介:
To support research into AIGC detection, text source identification and academic integrity within the context of Chinese academic writing, this study has constructed the MGRD (A Multi-level Gradient Rewriting Dataset of Chinese Academic Paper Abstracts for AIGC Detection) dataset. MGRD utilises abstracts from Chinese core journal articles indexed by CNKI and Wanfang Data between 2010 and 2022 as human-generated text sources, covering three disciplinary areas: computer technology, architectural theory and Chinese drama. Based on the original abstracts, five large language models—glm-4.5-air, glm-4.6v, qwen3-14b, deepseek-R1 and gpt-4o-mini—were utilised to generate AIGC samples at three levels: light polishing, moderate rewriting and heavy rewriting. The heavy rewriting samples were generated independently based solely on the paper titles and keywords. Following rule-based filtering, hierarchical constraints, removal of anomalous samples, semantic consistency verification, perplexity analysis and blind expert sampling, four data files were generated: light_paired.csv, medium_paired.csv, heavy_paired.csv and all_paired.csv, comprising 6,011 pairs of light samples, 5,943 pairs of medium samples, and 6,343 pairs of heavy samples, totalling 18,297 pairs and 36,594 text entries. The dataset retains two core fields—`text` and `label`—and provides auxiliary fields such as paper title, keywords, rewrite_level and change ratio, supporting mixed-scenario model training, cross-dataset generalisation evaluation and text source analysis. Evaluation results indicate that MGRD can serve as a foundational data resource for research into AIGC detection in Chinese academic paper abstracts.
提供机构:
Science Data Bank
创建时间:
2026-04-14



