five

bigbio/progene

收藏
Hugging Face2022-12-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bigbio/progene
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en bigbio_language: - English license: cc-by-4.0 multilinguality: monolingual bigbio_license_shortname: CC_BY_4p0 pretty_name: ProGene homepage: https://zenodo.org/record/3698568#.YlVHqdNBxeg bigbio_pubmed: True bigbio_public: True bigbio_tasks: - NAMED_ENTITY_RECOGNITION --- # Dataset Card for ProGene ## Dataset Description - **Homepage:** https://zenodo.org/record/3698568#.YlVHqdNBxeg - **Pubmed:** True - **Public:** True - **Tasks:** NER The Protein/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn. The executing scientist was Dr. Joachim Wermter. The main annotator was Dr. Rico Pusch who is an expert in biology. The corpus was developed in the context of the StemNet project (http://www.stemnet.de/). ## Citation Information ``` @inproceedings{faessler-etal-2020-progene, title = "{P}ro{G}ene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus", author = "Faessler, Erik and Modersohn, Luise and Lohr, Christina and Hahn, Udo", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.564", pages = "4585--4596", abstract = "Genes and proteins constitute the fundamental entities of molecular genetics. We here introduce ProGene (formerly called FSU-PRGE), a corpus that reflects our efforts to cope with this important class of named entities within the framework of a long-lasting large-scale annotation campaign at the Jena University Language {\&} Information Engineering (JULIE) Lab. We assembled the entire corpus from 11 subcorpora covering various biological domains to achieve an overall subdomain-independent corpus. It consists of 3,308 MEDLINE abstracts with over 36k sentences and more than 960k tokens annotated with nearly 60k named entity mentions. Two annotators strove for carefully assigning entity mentions to classes of genes/proteins as well as families/groups, complexes, variants and enumerations of those where genes and proteins are represented by a single class. The main purpose of the corpus is to provide a large body of consistent and reliable annotations for supervised training and evaluation of machine learning algorithms in this relevant domain. Furthermore, we provide an evaluation of two state-of-the-art baseline systems {---} BioBert and flair {---} on the ProGene corpus. We make the evaluation datasets and the trained models available to encourage comparable evaluations of new methods in the future.", language = "English", ISBN = "979-10-95546-34-4", } ```
提供机构:
bigbio
原始信息汇总

数据集概述

基本信息

  • 名称: ProGene
  • 语言: 英语
  • 许可证: CC-BY-4.0
  • 多语言性: 单语种
  • PubMed可用性: 是
  • 公开性: 是

任务类型

  • 主要任务: 命名实体识别(NER)

数据集详情

  • 开发机构: JULIE Lab Jena
  • 执行科学家: Dr. Joachim Wermter
  • 主要注释者: Dr. Rico Pusch
  • 开发背景: StemNet项目
  • 数据组成: 包含3,308篇MEDLINE摘要,超过36,000句子和960,000个词条,近60,000个命名实体提及

引用信息

@inproceedings{faessler-etal-2020-progene, title = "{P}ro{G}ene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus", author = "Faessler, Erik and Modersohn, Luise and Lohr, Christina and Hahn, Udo", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.564", pages = "4585--4596", abstract = "...", language = "English", ISBN = "979-10-95546-34-4", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作