MilkOligoCorpus, a rich semantic annotated resource for milk oligosaccharide complex information extraction

Name: MilkOligoCorpus, a rich semantic annotated resource for milk oligosaccharide complex information extraction
Creator: Recherche Data Gouv
Published: 2025-05-16 23:23:38
License: 暂无描述

DataCite Commons2025-05-16 更新2025-04-16 收录

下载链接：

https://entrepot.recherche.data.gouv.fr/citation?persistentId=doi:10.57745/LFXGFO

下载链接

链接失效反馈

官方服务：

资源简介：

The MilkOligoCorpus is a dataset of 30 Pubmed abstracts and full-text extracts from scientific articles on the composition of milk oligosaccharides in mammalian species, manually annotated for training and evaluating information extraction tools. This corpus is designed to support the development and assessment of tools for named entity recognition, entity linking and relation extraction to extract the variability of milk oligosaccharides profiles. Named entity linking is essential for integrating information from diverse sources by mapping entity mentions to standard categories and associating them with unique identifiers. Thus, along with the corpus annotation we developed four semantic resources to address the absence of existing ontologies for several entities: (i) the Female parity thesaurus, (ii) the sample thesaurus, (iii) the MO methods thesaurus, (iv) the Oligo type thesaurus available at https://doi.org/10.57745/RA5DAC. An annotation schema was also developed, that identifies the entities of interest and establishes relations between them. This annotation schema serves as the foundation for the manual annotations along with guidelines, a 66-pages document that dictates the instructions on how to perform the annotations, available in the repository Z. This archive includes: (i) the HoloOligo corpus dataset, (ii) the list of the document annotated in the HoloOligo corpus, (iii) the three thesaurus required for the manual annotation, which are not available elsewhere, (iv) the annotation schema. An article detailing the development of the annotation schema and the creation of the gold standard corpus will be submited to PLOS One.

牛奶低聚糖语料库（MilkOligoCorpus）是一个包含30篇PubMed数据库中哺乳动物乳汁低聚糖组成相关科学文章的摘要及全文节选的数据集，经人工标注以用于信息抽取工具的训练与评估。该语料库旨在支持用于提取乳汁低聚糖谱变异特征的命名实体识别（Named Entity Recognition）、实体链接（Entity Linking）及关系抽取（Relation Extraction）工具的开发与评测。实体链接通过将实体提及映射至标准分类体系并为其关联唯一标识符，是整合多源信息的核心环节。据此，针对部分实体尚无现有本体的问题，我们在完成语料标注的同时开发了四类语义资源：（1）雌性生育次数叙词表（Female parity thesaurus）、（2）样本叙词表（Sample thesaurus）、（3）MO方法叙词表（MO methods thesaurus）、（4）低聚糖类型叙词表（Oligo type thesaurus），相关资源可通过https://doi.org/10.57745/RA5DAC获取。我们还制定了标注规范（annotation schema），用于界定目标实体并明确实体间的关联关系。该标注规范与一份长达66页的标注指南共同构成了人工标注的基础，该指南详细说明了标注执行的具体要求，可在仓库Z中获取。该归档文件包含以下内容：（1）HoloOligo语料数据集；（2）HoloOligo语料库中标注文档的清单；（3）人工标注所需的其余三处渠道均无法获取的三份叙词表；（4）标注规范。一篇详细阐述标注规范开发过程与金标准语料库构建方法的论文将投稿至《PLOS ONE》。

提供机构：

Recherche Data Gouv

创建时间：

2024-11-14

搜集汇总

数据集介绍

背景与挑战

背景概述

MilkOligoCorpus是一个包含30篇科学文献的语义标注数据集，专注于哺乳动物乳汁寡糖成分的信息提取。它提供了详细的标注规范和四个专门的词库，旨在支持命名实体识别、实体链接和关系提取工具的开发与评估。

以上内容由遇见数据集搜集并总结生成