Paper Mill Dataset
收藏DataCite Commons2025-04-27 更新2025-05-18 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=429c3d22f4094aafb5e52a2dea914fc0
下载链接
链接失效反馈官方服务:
资源简介:
This study created a new dataset that includes both normal and factory papers with similar themes. The author hopes to build a classification model on this dataset that is not affected by the topic of the paper and can identify potential features of the paper factory in terms of writing or article structure. This study collected all papers clearly marked as paper factories from withdrawal observation data, and as of December 31, 2022, a total of 1910 papers were obtained. Multiple open access platforms were used to supplement the metadata of the papers, including titles, authors, publication dates, and original DOIs. On this basis, the original PDFs (pre withdrawal published texts) of these papers were also collected using open access platforms. In the end, this study successfully obtained 1535 PDFs. In order to collect normal papers similar to existing paper factories, this study used PubMed The similar paper function provided by websites such as ConnectPapers allows for the collection of at least one unreleased paper on a similar topic as a normal paper for each factory paper, resulting in a total of 2398 normal papers found. Divide the dataset into training, validation, and testing sets in a 7:1.5:1.5 ratio. This article develops an automatic parsing tool for PDF files, which can parse the most common paper manuscript submission format PDF, analyze the byte stream of the file, read the corresponding text in structural order, and identify the text content of the paper title, abstract, and main body.
提供机构:
Science Data Bank
创建时间:
2024-07-26



