Europe PMC Full Text Corpus

Name: Europe PMC Full Text Corpus
Creator: figshare
Published: 2025-05-01 07:07:38
License: 暂无描述

DataCite Commons2025-05-01 更新2024-08-18 收录

下载链接：

https://figshare.com/articles/dataset/Europe_PMC_Full_Text_Corpus/22848380/1

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism. Corpus Directory Structure <code>annotations/</code>: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats. <code>hypothesis/csv/</code>: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. <code>GROUP0/</code>: contains raw manual annotations made by curator GROUP0. <code>GROUP1/</code>: contains raw manual annotations made by curator GROUP1. <code>GROUP2/</code>: contains raw manual annotations made by curator GROUP2. <code>IOB/</code>: contains automatically extracted annotations using raw manual annotations in <code>hypothesis/csv/</code>, which is in Inside–Outside–Beginning tagging format. <code>dev/</code>: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. <code>test/</code>: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. <code>train/</code>: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task. <code>JSON/</code>: contains automatically extracted annotations using raw manual annotations in <code>hypothesis/csv/</code>, which is in JSON format. <code>README.md</code>: a detailed description of all the annotation formats. <code>articles/</code>: contains the full-text articles annotated in Europe PMC corpus. <code>Sentencised/</code>: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. <code>XML/</code>: contains XML articles directly fetched using Europe PMC Article Restful API. <code>README.md</code>: a detailed description of the sentencising and fetching of XML articles. <code>docs/</code>: contains related documents that were used for generating the corpus. <code>Annotation guideline.pdf</code>: annotation guideline that is provided to curators to assist the manual annotation. <code>demo to molecular conenctions.pdf</code>: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. <code>Training set development.pdf</code>: initial document that details the paper selection procedures. <code>pilot/</code>: contains annotations and articles that were used in a pilot study. <code>annotations/csv/</code>: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. <code>articles/</code>: contains the full-text articles annotated in the pilot study. <code>Sentencised/</code>: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. <code>XML/</code>: contains XML articles directly fetched using Europe PMC Article Restful API. <code>README.md</code>: a detailed description of the sentencising and fetching of XML articles. <code>src/</code>: source codes for cleaning annotations and generating IOB files <code>metrics/ner_metrics.py</code>: Python script contains SemEval evaluation metrics. <code>annotations.py</code>: Python script used to extract annotations from raw Hypothes.is annotations. <code>generate_IOB_dataset.py</code>: Python script used to convert JSON format annotations to IOB tagging format. <code>generate_json_dataset.py</code>: Python script used to extract annotations to JSON format. <code>hypothesis.py</code>: Python script used to fetch raw Hypothes.is annotations. License <code>CCBY</code> Feedback For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.

本仓库包含欧洲PubMed Central（Europe PMC）全文语料库，该语料库源自Europe PMC开放获取子集中的300篇学术文章。每篇文章均包含三类核心实体，由编审手动标注：基因/蛋白质（Gene/Protein）、疾病（Disease）以及生物体（Organism）。 ### 语料库目录结构 1. **annotations/**：存放Europe PMC语料库中300篇全文文章的标注数据，共提供三种格式的标注文件。 - *hypothesis/csv/*：存储从标注平台Hypothes.is获取的原始标注数据，格式为逗号分隔值（CSV）。其中`GROUP0/`文件夹存储编审GROUP0完成的原始手动标注；`GROUP1/`文件夹存储编审GROUP1完成的原始手动标注；`GROUP2/`文件夹存储编审GROUP2完成的原始手动标注。 - *IOB/*：基于`hypothesis/csv/`中的原始手动标注自动提取得到的标注数据，采用IOB（Inside–Outside–Beginning）标注格式。其中`dev/`文件夹包含45篇文章的IOB格式标注，用作机器学习任务的开发集；`test/`文件夹包含45篇文章的IOB格式标注，用作机器学习任务的测试集；`train/`文件夹包含210篇文章的IOB格式标注，用作机器学习任务的训练集。 - *JSON/*：基于`hypothesis/csv/`中的原始手动标注自动提取得到的标注数据，格式为JSON。 - `README.md`：详细说明所有标注格式的说明文档。 2. **articles/**：存放Europe PMC语料库中已完成标注的全文文章。 - `Sentencised/`：存储使用Europe PMC分句工具完成分句后的XML格式文章。 - `XML/`：存储通过Europe PMC文章RESTful API直接获取的原始XML格式文章。 - `README.md`：详细说明XML文章的获取与分句处理流程的说明文档。 3. **docs/**：存放用于构建该语料库的相关辅助文档。 - `Annotation guideline.pdf`：面向编审的手动标注工作指南。 - `demo to molecular conenctions.pdf`：帮助编审熟悉Hypothes.is标注平台的操作指南文档（原文拼写疑似笔误，应为connections）。 - `Training set development.pdf`：详细说明文章筛选流程的初始文档。 4. **pilot/**：存储试点研究中使用的标注数据与对应文章。 - `annotations/csv/`：从Hypothes.is标注平台获取的原始CSV格式标注数据。 - `articles/`：试点研究中完成标注的全文文章。 - `Sentencised/`：使用Europe PMC分句工具完成分句后的XML格式文章。 - `XML/`：通过Europe PMC文章RESTful API直接获取的原始XML格式文章。 - `README.md`：详细说明XML文章的获取与分句处理流程的说明文档。 5. **src/**：存储用于清理标注数据与生成IOB格式文件的源代码。 - `metrics/ner_metrics.py`：包含SemEval评估指标的Python脚本。 - `annotations.py`：用于从Hypothes.is原始标注数据中提取标注信息的Python脚本。 - `generate_IOB_dataset.py`：用于将JSON格式标注转换为IOB标注格式的Python脚本。 - `generate_json_dataset.py`：用于将标注数据提取为JSON格式的Python脚本。 - `hypothesis.py`：用于获取Hypothes.is原始标注数据的Python脚本。 ### 许可协议本数据集采用CCBY许可协议。 ### 反馈与联系如有任何意见、问题或建议，请通过helpdesk@europepmc.org或Europe PMC官方联系页面与我们取得联系。

提供机构：

figshare

创建时间：

2023-05-25

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集