five

MDPI Open Peer Review Corpus 2

收藏
DataCite Commons2024-10-24 更新2025-04-16 收录
下载链接:
https://repod.icm.edu.pl/citation?persistentId=doi:10.18150/SHKP7B
下载链接
链接失效反馈
官方服务:
资源简介:
MDPI Open Peer Review Corpus 2Section for Logic & Cognitive Science, Institute of Philosophy and Sociology, Polish Academy of ScienceCognitive Metascience LabGenerated by Ksawery Jasieński, with some input from Remigiusz Depta, under supervision of Marcin Miłkowski (2022-2023)---MDPI is committed to open peer review idea, but these are voluntary. They are not available for download in a single package, so they must be crawled from their website.This dataset contains all peer reviews available on mdpi.com as of January 2023, covering over 135 thousand papers. These are in plaintext format (look for TXT files). In addition, the corpus contains metadata in JSON format for particular reviews, author responses, as well as original paper metadata. For reference see the JSON `schema` files available in the GitHub repository associated with this project.Additionally, this dataset contains the source HTML for each website from which the text of reviews was extracted, as well as any supplementary materials attached with the reviews. The original files were not enriched with any linguistic annotation or converted to any format (these are predominantly PDF and DOCX files, as uploaded through the MDPI editorial system by authors and reviewers).We are making source code available for the dedicated crawler that was built to scrape the MDPI database. See the GitHub link below:https://github.com/cognitive-metascience/review_crawler/tree/main/crawlingSee this corpus on PubPeer:https://pubpeer.com/publications/25353AAFD4FC52E2BEC8C7AD08B259#---This dataset is split into parts because of the upload limits of this repository. The archives are available in ZIP (33 parts, look for `.z[01-33]` files) and 7z (23 parts) formats. In addition, we provide the set of excluded articles with incomplete information (e.g., missing some reviews in the first round etc.) in the `mdpi-dump-dir.zip` file.The dataset after unpacking is a little over 170 GB in size.The files are being made available under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).Changelogcurrent version - August 2023no new reviews addedre-scraped (June 2023) the reviews that had some sections of text missingdataset now also includes the source HTML from mdpi.com for each reviewed article. The webpage content was cleaned before storing to an HTML file: specifically, all comments are removed from the document, as well as the following tags: 'script', 'style', 'noscript', 'link', 'rect'.14th March 2023first submission of this datasetreviews was scraped in early January 2023dataset contains metadata for 135652 peer-reviewed articles from the MDPI database, along with full plain text for each review and any supplementary materials that were attached (PDF or DOC files containing e.g. author responses to comments)
提供机构:
RepOD
创建时间:
2023-02-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作