SoMeSci (Software Mentions in Scientific Articles)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/SoMeSci
下载链接
链接失效反馈官方服务:
资源简介:
出于多种原因,有关科学研究中使用的软件的知识很重要,例如,有助于理解数据处理中涉及的出处和方法。然而,软件通常不会被正式引用,而是在调查的学术描述中被非正式地提及,从而提高了对自动信息提取和消歧的需求。鉴于缺乏可靠的基本事实数据,我们提出了 SoMeSci - 科学软件提及 - 科学文章中软件提及的黄金标准知识图。它包含 1367 篇 PubMed Central 文章中提到的 3756 个软件的高质量注释 (IRR: κ = .82)。除了简单地提及软件外,我们还提供附加信息的关系标签,例如版本、开发人员、URL 或引用。此外,我们区分不同类型,如应用程序、插件或编程环境,以及不同类型的提及,如使用或创建。据我们所知,SoMeSci 是科学文章中关于软件提及的最全面的语料库,为命名实体识别、关系提取、实体消歧和实体链接提供训练样本。最后,我们勾勒出潜在的用例,并为不同的任务提供基线结果。
Knowledge of software used in scientific research is important for multiple reasons—for instance, it aids in understanding the provenance and methods involved in data processing. However, software is often not formally cited but only informally mentioned in the academic descriptions of studies, which increases the demand for automated information extraction and disambiguation. Given the lack of reliable ground-truth data, we present SoMeSci – Scientific Software Mentions – a gold-standard knowledge graph of software mentions in scientific articles. It contains high-quality annotations (IRR: κ = .82) for 3756 software mentions across 1367 PubMed Central articles. In addition to simply noting software mentions, we also provide relational tags with additional information such as version, developer, URL, or citation. Furthermore, we distinguish between different software categories (e.g., applications, plugins, or programming environments) and different types of mentions (e.g., usage or creation). To the best of our knowledge, SoMeSci is the most comprehensive corpus of software mentions in scientific articles, providing training samples for named entity recognition, relation extraction, entity disambiguation, and entity linking. Finally, we outline potential use cases and present baseline results for different tasks.
提供机构:
OpenDataLab
创建时间:
2022-09-01
搜集汇总
数据集介绍

背景与挑战
背景概述
SoMeSci是一个科学文章中软件提及的黄金标准知识图谱,包含1367篇PubMed Central文章中的3756个软件提及高质量注释,并提供版本、开发者等关系标签。该数据集为命名实体识别和关系提取等任务提供训练样本,是科学领域软件提及最全面的语料库,发布于2021年。
以上内容由遇见数据集搜集并总结生成



