Europe PMC Full Text Corpus
收藏DataCite Commons2025-06-01 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/Europe_PMC_Full_Text_Corpus/22848380/2
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism. <br> Corpus Directory Structure <br> <code><strong>annotations/</strong></code><strong>:</strong> contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.<br> <code><em>hypothesis/csv/</em></code><em>:</em> contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. <code>GROUP0/</code>: contains raw manual annotations made by curator GROUP0. <code>GROUP1/</code>: contains raw manual annotations made by curator GROUP1. <code>GROUP2/</code>: contains raw manual annotations made by curator GROUP2. <br> <code><em>IOB/</em></code><em>: </em>contains automatically extracted annotations using raw manual annotations in <code>hypothesis/csv/</code>, which is in Inside–Outside–Beginning tagging format. <code>dev/</code>: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. <code>test/</code>: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. <code>train/</code>: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task. <br> <code><em>JSON/</em></code><em>: </em>contains automatically extracted annotations using raw manual annotations in <code>hypothesis/csv/</code>, which is in JSON format. <code>README.md</code>: a detailed description of all the annotation formats. <br> <code><strong>articles/</strong></code><strong>: </strong>contains the full-text articles annotated in Europe PMC corpus.<br> <code>Sentencised/</code>: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. <code>XML/</code>: contains XML articles directly fetched using Europe PMC Article Restful API. <code>README.md</code>: a detailed description of the sentencising and fetching of XML articles. <br> <code><strong>docs/</strong></code><strong>: </strong>contains related documents that were used for generating the corpus.<br> <code>Annotation guideline.pdf</code>: annotation guideline that is provided to curators to assist the manual annotation. <code>demo to molecular conenctions.pdf</code>: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. <code>Training set development.pdf</code>: initial document that details the paper selection procedures. <br> <code><strong>pilot/</strong></code><strong>: </strong>contains annotations and articles that were used in a pilot study.<br> <code>annotations/csv/</code>: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. <code>articles/</code>: contains the full-text articles annotated in the pilot study. <code>Sentencised/</code>: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. <code>XML/</code>: contains XML articles directly fetched using Europe PMC Article Restful API. <code>README.md</code>: a detailed description of the sentencising and fetching of XML articles. <br> <code><strong>src/</strong></code><strong>: </strong>source codes for cleaning annotations and generating IOB files<br> <code>metrics/ner_metrics.py</code>: Python script contains SemEval evaluation metrics. <code>annotations.py</code>: Python script used to extract annotations from raw Hypothes.is annotations. <code>generate_IOB_dataset.py</code>: Python script used to convert JSON format annotations to IOB tagging format. <code>generate_json_dataset.py</code>: Python script used to extract annotations to JSON format. <code>hypothesis.py</code>: Python script used to fetch raw Hypothes.is annotations. <br> License <br> <code>CCBY</code> <br> Feedback <br> For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.
提供机构:
figshare
创建时间:
2023-05-25



