five

Europe PMC Full Text Corpus

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/Europe_PMC_Full_Text_Corpus/22848380
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism. Corpus Directory Structure annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.   hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.   GROUP0/: contains raw manual annotations made by curator GROUP0. GROUP1/: contains raw manual annotations made by curator GROUP1. GROUP2/: contains raw manual annotations made by curator GROUP2. IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.   dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task. JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format. README.md: a detailed description of all the annotation formats. articles/: contains the full-text articles annotated in Europe PMC corpus.   Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles. docs/: contains related documents that were used for generating the corpus.   Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation. demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. Training set development.pdf: initial document that details the paper selection procedures. pilot/: contains annotations and articles that were used in a pilot study.   annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. articles/: contains the full-text articles annotated in the pilot study.   Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles. src/: source codes for cleaning annotations and generating IOB files   metrics/ner_metrics.py: Python script contains SemEval evaluation metrics. annotations.py: Python script used to extract annotations from raw Hypothes.is annotations. generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format. generate_json_dataset.py: Python script used to extract annotations to JSON format. hypothesis.py: Python script used to fetch raw Hypothes.is annotations. License CCBY Feedback For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.
创建时间:
2023-05-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作