AI-Generated Scientific Text Dataset (AIGTxt)
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/y9bj7734vf
下载链接
链接失效反馈官方服务:
资源简介:
The AI-Generated Scientific Text dataset (AIGTxt) is a collection of scientific texts generated by ChatGPT, designed to support the improvement of AI-generated text detection methods. AIGTxt comprises texts from published academic articles across ten domains. The selected research articles met three main criteria:
(1) They are written in English.
(2) Published before 2022 to ensure ChatGPT did not generate them.
(3) Fall within at least one of the following ten domains: computer science and artificial intelligence, natural language processing, medical research and healthcare, materials science and engineering, neuroscience and psychology, genetics and genomics, climate science and environmental studies, mathematics and statistics, astrophysics and astronomy, and social sciences and humanities.
The current AIGTxt version contains 10821 records spanning three classes: Human-written, ChatGPT-generated, and Mixed text, with 3607 samples per class. Human-written texts were manually selected from one to three paragraphs located in the introduction, background, or literature review sections of academic papers. The ChatGPT-generated texts were produced by rephrasing the original human-written content. Mixed texts were constructed by combining 50% human-written content with 50% ChatGPT-generated content, providing a balanced representation of blended authorship. The average paragraph length is approximately 168 words, and the dataset contains a rich vocabulary of more than 57,398 unique words (computed after excluding stopwords and citations). Domain-specific statistics reveal that the Computer Science & Artificial Intelligence domain has the highest number of records (1,644), while the Mathematics & Statistics domain has the fewest (645).
Column Descriptions:
(1) Human text written: Extracted paragraphs from the scientific publications.
(2) ChatGPT text written: Manual interaction with ChatGPT to rewrite the extracted human paragraphs.
(3) Mixed text written: Mixed paragraphs from human text written and ChatGPT text written.
(4) Domain: which identifies the paragraph topic.
** This dataset is associated with the research article: Alhijawi, Bushra, Rawan Jarrar, Aseel AbuAlRub, and Arwa Bader. "Deep learning detection method for large language models-generated scientific content." Neural Computing and Applications 37, no. 1 (2025): 91-104.
创建时间:
2025-12-30



