bigbio/mantra_gsc

Name: bigbio/mantra_gsc
Creator: bigbio
Published: 2024-07-18 14:49:51
License: 暂无描述

Hugging Face2024-07-18 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/bigbio/mantra_gsc

下载链接

链接失效反馈

官方服务：

资源简介：

我们选择了来自不同平行语料库（如Medline摘要标题、药物标签、生物医学专利声明）的文本单元，涵盖了英语、法语、德语、西班牙语和荷兰语。每个语言由三名注释者独立注释生物医学概念，基于统一医学语言系统的子集，并覆盖广泛的语义组。数据集的任务包括命名实体识别（NER）和命名实体消歧（NED）。

The Mantra GSC dataset consists of text units selected from different parallel corpora (such as Medline abstract titles, drug labels, and biomedical patent claims) in English, French, German, Spanish, and Dutch. Each language had three annotators independently annotating biomedical concepts based on a subset of the Unified Medical Language System, covering a wide range of semantic groups. Pre-annotations generated automatically were used to reduce the annotation workload. Individual annotations were automatically harmonized and then adjudicated, with cross-language consistency checks conducted to arrive at the final annotations, totaling 5530.

提供机构：

bigbio

原始信息汇总

数据集概述

数据集描述

名称: MantraGSC
语言:
- 原始语言: 英语、法语、德语、荷兰语、西班牙语
- 处理后语言: 英语、法语、德语、荷兰语、西班牙语
许可证: GPL-3.0
多语言性: 多语言
主页: https://github.com/mi-erasmusmc/Mantra-Gold-Standard-Corpus
是否公开: 是
是否包含PubMed数据: 是
任务类型:
- 命名实体识别 (NER)
- 命名实体消歧 (NED)

数据来源

文本单元选自不同平行语料库，包括Medline摘要标题、药物标签和生物医学专利声明。
每种语言由三位标注者独立标注生物医学概念，基于统一医学语言系统的一个子集，涵盖广泛的语义组。

引用信息

@article{10.1093/jamia/ocv037, author = {Kors, Jan A and Clematide, Simon and Akhondi, Saber A and van Mulligen, Erik M and Rebholz-Schuhmann, Dietrich}, title = "{A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC}", journal = {Journal of the American Medical Informatics Association}, volume = {22}, number = {5}, pages = {948-956}, year = {2015}, month = {05}, abstract = "{Objective To create a multilingual gold-standard corpus for biomedical concept recognition.Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations.Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language.Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques.Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.}", issn = {1067-5027}, doi = {10.1093/jamia/ocv037}, url = {https://doi.org/10.1093/jamia/ocv037}, eprint = {https://academic.oup.com/jamia/article-pdf/22/5/948/34146393/ocv037.pdf}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集