five

CZ Software Mentions: A large dataset of software mentions in the biomedical literature - Expanded 2024

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.zgmsbccjk
下载链接
链接失效反馈
官方服务:
资源简介:
We release a dataset of software mentions in open access biomedical papers published in bioRxv, medRxiv, or stored in euroPMC. The mentions are extracted with a trained BERT model. The dataset provides sources, context, metadata, software, and links. This is a continuation of our previous dataset, https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c, based on an expanded set of papers Methods We used the following corpus of papers: Source Last updated Number of unique DOIs bioRxiv  2023/08/23  198247 medRxiv  2023/08/24 43404 europePMC 2023/06/12 5289607 We run a BERT-based NER model on our corpus to extract plain-text software mentions. The model has been trained on the SoftCite dataset, uses the SciBERT model architecture and has an F1 score of 0.922. More details about this model can be found at https://github.com/chanzuckerberg/software-mention-extraction. We performed linking and disambiguation based on the procedures discussed in our paper https://arxiv.org/abs/2209.00693
创建时间:
2024-11-12
二维码
社区交流群
二维码
科研交流群
商业服务