Workflow for detecting biomedical articles with openly available underlying datasets - Datasets and extraction forms

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/8249757

下载链接

链接失效反馈

官方服务：

资源简介：

The open data screening datasets contain both automatically detected (TRUE) Open Data statements by ODDPub, and its manual validation using Numbat extraction tool. Furthermore, extraction forms for both screenings – 2020 and 2021 – are included. The manually processed dataset for the calculation of the inter-rater reliability of manual validation can be also found here. (i) Data from articles published in 2020 (file ‘charite_open_data_2020.csv’) have been collected applying a slightly different sequence of questions in the extraction workflow than the articles published in 2021 (file ‘charite_open_data_2021.csv’). Both datasets were cleaned for any personal data or internal comments. Thus, they do not contain the default columns which in the raw export from Numbat contained commentaries regarding different question. Also, in another regard these files do not represent raw outputs of the Numbat extraction tool, but a processed version. This means that articles validated by more than two raters were first reconciled in Numbat, resulting in one final decision (output of extractions after reconciliation). Then from the output of extractions before reconciliation those articles validated by only 1 rater (and thus not part of the inter-rater reliability calculation) were selected, which were afterwards joined with the already reconciled dataset. The actual decision about Openness of validated dataset can be analysed in various ways: Column ‘open_data_assessment’/’assessment’ shows a binary decision between Open Data TRUE and FALSE. If that column indicates ‘NULL’, the dataset was classified into ‘non’-open category, and the result can be found on one of the following ways: Column ‘reference_to_data’ as ‘n_a’ for excluded articles, e.g. not producing any data. Column ‘data_access’ as ‘restricted’. Column ‘own_or_reuse_data’ as ‘open_data_reuse’. The original extraction form contains an option ‘unsure_open_data’ besides ‘open_data’/’no_open_data’ which was resolved either during reconciliation between multiple raters or by case-related consultation with a second rater in case of doubt, and is not included here. (ii) The inter-rater reliability calculation was made on randomly selected 100 articles for 2 raters. The third rater screened 20 articles sample, which is part of 100 sample. The tables provided here include both article-level data, and dataset-level data. (iii) The Numbat extarction forms used for the screenings in 2020 and 2021 are included in two formats - JSON and Markdown. (iv) ‘data_dictionary_open_data.csv’ table documents all variables of each data file containing here.

创建时间：

2023-09-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集