CZ Software Mentions: A large dataset of software mentions in the biomedical literature - Expanded 2024
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.zgmsbccjk
下载链接
链接失效反馈官方服务:
资源简介:
We release a dataset of software mentions in open access biomedical papers published in bioRxv, medRxiv, or stored in euroPMC. The mentions are extracted with a trained BERT model. The dataset provides sources, context, metadata, software, and links.
This is a continuation of our previous dataset, https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c, based on an expanded set of papers
Methods
We used the following corpus of papers:
Source
Last updated
Number of unique DOIs
bioRxiv
2023/08/23
198247
medRxiv
2023/08/24
43404
europePMC
2023/06/12
5289607
We run a BERT-based NER model on our corpus to extract plain-text software mentions. The model has been trained on the SoftCite dataset, uses the SciBERT model architecture and has an F1 score of 0.922. More details about this model can be found at https://github.com/chanzuckerberg/software-mention-extraction. We performed linking and disambiguation based on the procedures discussed in our paper https://arxiv.org/abs/2209.00693
创建时间:
2024-11-12



