Europe PMC Full Text Corpus
收藏DataCite Commons2025-05-01 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/Europe_PMC_Full_Text_Corpus/22848380/1
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism. <br> Corpus Directory Structure <br> <code><strong>annotations/</strong></code><strong>:</strong> contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.<br> <code><em>hypothesis/csv/</em></code><em>:</em> contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. <code>GROUP0/</code>: contains raw manual annotations made by curator GROUP0. <code>GROUP1/</code>: contains raw manual annotations made by curator GROUP1. <code>GROUP2/</code>: contains raw manual annotations made by curator GROUP2. <br> <code><em>IOB/</em></code><em>: </em>contains automatically extracted annotations using raw manual annotations in <code>hypothesis/csv/</code>, which is in Inside–Outside–Beginning tagging format. <code>dev/</code>: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. <code>test/</code>: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. <code>train/</code>: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task. <br> <code><em>JSON/</em></code><em>: </em>contains automatically extracted annotations using raw manual annotations in <code>hypothesis/csv/</code>, which is in JSON format. <code>README.md</code>: a detailed description of all the annotation formats. <br> <code><strong>articles/</strong></code><strong>: </strong>contains the full-text articles annotated in Europe PMC corpus.<br> <code>Sentencised/</code>: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. <code>XML/</code>: contains XML articles directly fetched using Europe PMC Article Restful API. <code>README.md</code>: a detailed description of the sentencising and fetching of XML articles. <br> <code><strong>docs/</strong></code><strong>: </strong>contains related documents that were used for generating the corpus.<br> <code>Annotation guideline.pdf</code>: annotation guideline that is provided to curators to assist the manual annotation. <code>demo to molecular conenctions.pdf</code>: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. <code>Training set development.pdf</code>: initial document that details the paper selection procedures. <br> <code><strong>pilot/</strong></code><strong>: </strong>contains annotations and articles that were used in a pilot study.<br> <code>annotations/csv/</code>: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. <code>articles/</code>: contains the full-text articles annotated in the pilot study. <code>Sentencised/</code>: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. <code>XML/</code>: contains XML articles directly fetched using Europe PMC Article Restful API. <code>README.md</code>: a detailed description of the sentencising and fetching of XML articles. <br> <code><strong>src/</strong></code><strong>: </strong>source codes for cleaning annotations and generating IOB files<br> <code>metrics/ner_metrics.py</code>: Python script contains SemEval evaluation metrics. <code>annotations.py</code>: Python script used to extract annotations from raw Hypothes.is annotations. <code>generate_IOB_dataset.py</code>: Python script used to convert JSON format annotations to IOB tagging format. <code>generate_json_dataset.py</code>: Python script used to extract annotations to JSON format. <code>hypothesis.py</code>: Python script used to fetch raw Hypothes.is annotations. <br> License <br> <code>CCBY</code> <br> Feedback <br> For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.
本仓库包含欧洲PubMed Central(Europe PMC)全文语料库,该语料库源自Europe PMC开放获取子集中的300篇学术文章。每篇文章均包含三类核心实体,由编审手动标注:基因/蛋白质(Gene/Protein)、疾病(Disease)以及生物体(Organism)。
### 语料库目录结构
1. **annotations/**:存放Europe PMC语料库中300篇全文文章的标注数据,共提供三种格式的标注文件。
- *hypothesis/csv/*:存储从标注平台Hypothes.is获取的原始标注数据,格式为逗号分隔值(CSV)。其中`GROUP0/`文件夹存储编审GROUP0完成的原始手动标注;`GROUP1/`文件夹存储编审GROUP1完成的原始手动标注;`GROUP2/`文件夹存储编审GROUP2完成的原始手动标注。
- *IOB/*:基于`hypothesis/csv/`中的原始手动标注自动提取得到的标注数据,采用IOB(Inside–Outside–Beginning)标注格式。其中`dev/`文件夹包含45篇文章的IOB格式标注,用作机器学习任务的开发集;`test/`文件夹包含45篇文章的IOB格式标注,用作机器学习任务的测试集;`train/`文件夹包含210篇文章的IOB格式标注,用作机器学习任务的训练集。
- *JSON/*:基于`hypothesis/csv/`中的原始手动标注自动提取得到的标注数据,格式为JSON。
- `README.md`:详细说明所有标注格式的说明文档。
2. **articles/**:存放Europe PMC语料库中已完成标注的全文文章。
- `Sentencised/`:存储使用Europe PMC分句工具完成分句后的XML格式文章。
- `XML/`:存储通过Europe PMC文章RESTful API直接获取的原始XML格式文章。
- `README.md`:详细说明XML文章的获取与分句处理流程的说明文档。
3. **docs/**:存放用于构建该语料库的相关辅助文档。
- `Annotation guideline.pdf`:面向编审的手动标注工作指南。
- `demo to molecular conenctions.pdf`:帮助编审熟悉Hypothes.is标注平台的操作指南文档(原文拼写疑似笔误,应为connections)。
- `Training set development.pdf`:详细说明文章筛选流程的初始文档。
4. **pilot/**:存储试点研究中使用的标注数据与对应文章。
- `annotations/csv/`:从Hypothes.is标注平台获取的原始CSV格式标注数据。
- `articles/`:试点研究中完成标注的全文文章。
- `Sentencised/`:使用Europe PMC分句工具完成分句后的XML格式文章。
- `XML/`:通过Europe PMC文章RESTful API直接获取的原始XML格式文章。
- `README.md`:详细说明XML文章的获取与分句处理流程的说明文档。
5. **src/**:存储用于清理标注数据与生成IOB格式文件的源代码。
- `metrics/ner_metrics.py`:包含SemEval评估指标的Python脚本。
- `annotations.py`:用于从Hypothes.is原始标注数据中提取标注信息的Python脚本。
- `generate_IOB_dataset.py`:用于将JSON格式标注转换为IOB标注格式的Python脚本。
- `generate_json_dataset.py`:用于将标注数据提取为JSON格式的Python脚本。
- `hypothesis.py`:用于获取Hypothes.is原始标注数据的Python脚本。
### 许可协议
本数据集采用CCBY许可协议。
### 反馈与联系
如有任何意见、问题或建议,请通过helpdesk@europepmc.org或Europe PMC官方联系页面与我们取得联系。
提供机构:
figshare
创建时间:
2023-05-25
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



