five

qanastek/HoC

收藏
Hugging Face2022-11-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/qanastek/HoC
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated - expert-generated language_creators: - found language: - en size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification pretty_name: HoC language_bcp47: - en-US --- # HoC : Hallmarks of Cancer Corpus ## Table of Contents - [Dataset Card for [Needs More Information]](#dataset-card-for-needs-more-information) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [No Warranty](#no-warranty) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://s-baker.net/resource/hoc/ - **Repository:** https://github.com/sb895/Hallmarks-of-Cancer - **Paper:** https://academic.oup.com/bioinformatics/article/32/3/432/1743783 - **Leaderboard:** https://paperswithcode.com/dataset/hoc-1 - **Point of Contact:** [Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr) ### Dataset Summary The Hallmarks of Cancer Corpus for text classification The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID). In addition to the HOC corpus, we also have the [Cancer Hallmarks Analytics Tool](http://chat.lionproject.net/) which classifes all of PubMed according to the HoC taxonomy. ### Supported Tasks and Leaderboards The dataset can be used to train a model for `multi-class-classification`. ### Languages The corpora consists of PubMed article only in english: - `English - United States (en-US)` ## Load the dataset with HuggingFace ```python from datasets import load_dataset dataset = load_dataset("qanastek/HoC") validation = dataset["validation"] print("First element of the validation set : ", validation[0]) ``` ## Dataset Structure ### Data Instances ```json { "document_id": "12634122_5", "text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .", "label": [9, 5, 0, 6] } ``` ### Data Fields `document_id`: Unique identifier of the document. `text`: Raw text of the PubMed abstracts. `label`: One of the 10 currently known hallmarks of cancer. | Hallmark | Search term | |:-------------------------------------------:|:-------------------------------------------:| | 1. Sustaining proliferative signaling (PS) | Proliferation Receptor Cancer | | | 'Growth factor' Cancer | | | 'Cell cycle' Cancer | | 2. Evading growth suppressors (GS) | 'Cell cycle' Cancer | | | 'Contact inhibition' | | 3. Resisting cell death (CD) | Apoptosis Cancer | | | Necrosis Cancer | | | Autophagy Cancer | | 4. Enabling replicative immortality (RI) | Senescence Cancer | | | Immortalization Cancer | | 5. Inducing angiogenesis (A) | Angiogenesis Cancer | | | 'Angiogenic factor' | | 6. Activating invasion & metastasis (IM) | Metastasis Invasion Cancer | | 7. Genome instability & mutation (GI) | Mutation Cancer | | | 'DNA repair' Cancer | | | Adducts Cancer | | | 'Strand breaks' Cancer | | | 'DNA damage' Cancer | | 8. Tumor-promoting inflammation (TPI) | Inflammation Cancer | | | 'Oxidative stress' Cancer | | | Inflammation 'Immune response' Cancer | | 9. Deregulating cellular energetics (CE) | Glycolysis Cancer; 'Warburg effect' Cancer | | 10. Avoiding immune destruction (ID) | 'Immune system' Cancer | | | Immunosuppression Cancer | ### Data Splits Distribution of data for the 10 hallmarks: | **Hallmark** | **No. abstracts** | **No. sentences** | |:------------:|:-----------------:|:-----------------:| | 1. PS | 462 | 993 | | 2. GS | 242 | 468 | | 3. CD | 430 | 883 | | 4. RI | 115 | 295 | | 5. A | 143 | 357 | | 6. IM | 291 | 667 | | 7. GI | 333 | 771 | | 8. TPI | 194 | 437 | | 9. CE | 105 | 213 | | 10. ID | 108 | 226 | ## Dataset Creation ### Source Data #### Who are the source language producers? The corpus has been produced and uploaded by Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna. ### Personal and Sensitive Information The corpora is free of personal or sensitive information. ## Additional Information ### Dataset Curators __HoC__: Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna __Hugging Face__: Labrak Yanis (Not affiliated with the original corpus) ### Licensing Information ```plain GNU General Public License v3.0 ``` ```plain Permissions - Commercial use - Modification - Distribution - Patent use - Private use Limitations - Liability - Warranty Conditions - License and copyright notice - State changes - Disclose source - Same license ``` ### Citation Information We would very much appreciate it if you cite our publications: [Automatic semantic classification of scientific literature according to the hallmarks of cancer](https://academic.oup.com/bioinformatics/article/32/3/432/1743783) ```bibtex @article{baker2015automatic, title={Automatic semantic classification of scientific literature according to the hallmarks of cancer}, author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={32}, number={3}, pages={432--440}, year={2015}, publisher={Oxford University Press} } ``` [Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer](https://www.repository.cam.ac.uk/bitstream/handle/1810/265268/btx454.pdf?sequence=8&isAllowed=y) ```bibtex @article{baker2017cancer, title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer}, author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={33}, number={24}, pages={3973--3981}, year={2017}, publisher={Oxford University Press} } ``` [Cancer hallmark text classification using convolutional neural networks](https://www.repository.cam.ac.uk/bitstream/handle/1810/270037/BIOTXTM2016.pdf?sequence=1&isAllowed=y) ```bibtex @article{baker2017cancer, title={Cancer hallmark text classification using convolutional neural networks}, author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo}, year={2016} } ``` [Initializing neural networks for hierarchical multi-label text classification](http://www.aclweb.org/anthology/W17-2339) ```bibtex @article{baker2017initializing, title={Initializing neural networks for hierarchical multi-label text classification}, author={Baker, Simon and Korhonen, Anna}, journal={BioNLP 2017}, pages={307--315}, year={2017} } ```

annotations_creators: - 机器生成 - 专家生成 language_creators: - 采集获取 language: - 英语 size_categories: - 1000 < 样本数 < 10000 source_datasets: - 原创数据集 task_categories: - 文本分类(text-classification) task_ids: - 多类别分类(multi-class-classification) pretty_name: HoC language_bcp47: - en-US(美式英语) # HoC:癌症标志语料库(Hallmarks of Cancer Corpus) ## 目录 - [需补充更多信息的数据集卡片](#dataset-card-for-needs-more-information) - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概览](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [初始数据收集与标准化](#initial-data-collection-and-normalization) - [文本来源生产者是谁?](#who-are-the-source-language-producers) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [无担保声明](#no-warranty) - [引用信息](#citation-information) ## 数据集描述 - **主页**:https://s-baker.net/resource/hoc/ - **代码仓库**:https://github.com/sb895/Hallmarks-of-Cancer - **相关论文**:https://academic.oup.com/bioinformatics/article/32/3/432/1743783 - **排行榜**:https://paperswithcode.com/dataset/hoc-1 - **联系人**:[Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr) ### 数据集概览 本癌症标志语料库(Hallmarks of Cancer Corpus,简称HoC)用于文本分类任务。 HoC语料库包含1852篇PubMed(美国国家医学图书馆文献数据库)文献摘要,由专家依据分层分类体系手动标注。该分类体系共包含37个层级类别。语料库中的每个句子可被分配零个或多个类别标签。标签文件存储于「labels」目录,分词后的文本存储于「text」目录,文件名即为对应的PubMed编号(PMID)。 除HoC语料库外,我们还开发了**癌症标志分析工具(Cancer Hallmarks Analytics Tool,CHAT)**,可依据HoC分类体系对所有PubMed文献进行分类。 ### 支持任务与排行榜 本数据集可用于训练多类别分类(multi-class-classification)模型。 ### 语言 本语料库仅包含英文PubMed文献: - 美式英语(en-US) ## 使用Hugging Face加载数据集 python from datasets import load_dataset dataset = load_dataset("qanastek/HoC") validation = dataset["validation"] print("First element of the validation set : ", validation[0]) ## 数据集结构 ### 数据样例 json { "document_id": "12634122_5", "text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .", "label": [9, 5, 0, 6] } ### 数据字段 `document_id`:文档的唯一标识符。 `text`:PubMed摘要的原始文本。 `label`:属于10种当前已知的癌症标志之一。 | 癌症标志 | 检索词 | |:-------------------------------------------:|:-------------------------------------------: | 1. 持续增殖信号(Sustaining proliferative signaling,PS) | 增殖受体 癌症 | | '生长因子' 癌症 | | '细胞周期' 癌症 | 2. 规避生长抑制(Evading growth suppressors,GS) | '细胞周期' 癌症 | | '接触抑制' | 3. 抵抗细胞死亡(Resisting cell death,CD) | 细胞凋亡 癌症 | | 细胞坏死 癌症 | | 细胞自噬 癌症 | 4. 获得复制永生性(Enabling replicative immortality,RI) | 细胞衰老 癌症 | | 细胞永生化 癌症 | 5. 诱导血管生成(Inducing angiogenesis,A) | 血管生成 癌症 | | '血管生成因子' | 6. 激活侵袭与转移(Activating invasion & metastasis,IM) | 转移 侵袭 癌症 | 7. 基因组不稳定性与突变(Genome instability & mutation,GI) | 突变 癌症 | | 'DNA修复' 癌症 | | DNA加合物 癌症 | | 'DNA链断裂' 癌症 | | 'DNA损伤' 癌症 | 8. 促肿瘤炎症(Tumor-promoting inflammation,TPI) | 炎症 癌症 | | '氧化应激' 癌症 | | 炎症 '免疫应答' 癌症 | 9. 细胞能量代谢异常(Deregulating cellular energetics,CE) | 糖酵解 癌症; '瓦博格效应' 癌症 | 10. 逃逸免疫破坏(Avoiding immune destruction,ID) | '免疫系统' 癌症 | | 免疫抑制 癌症 ### 数据划分 10种癌症标志的数据分布如下: | **癌症标志** | **摘要数量** | **句子数量** | |:------------:|:-----------------:|:----------------:| | 1. PS | 462 | 993 | 2. GS | 242 | 468 | 3. CD | 430 | 883 | 4. RI | 115 | 295 | 5. A | 143 | 357 | 6. IM | 291 | 667 | 7. GI | 333 | 771 | 8. TPI | 194 | 437 | 9. CE | 105 | 213 | 10. ID | 108 | 226 ## 数据集构建 ### 源数据 #### 文本来源生产者 本语料库由Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla以及Korhonen Anna制作并上传。 ### 个人与敏感信息 本语料库不包含任何个人或敏感信息。 ## 附加信息 ### 数据集维护者 **HoC原维护者**:Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla、Korhonen Anna **Hugging Face适配维护者**:Labrak Yanis(与原语料库无关联) ### 许可信息 plain GNU通用公共许可证v3.0(GNU General Public License v3.0) plain 许可权限 - 商业使用 - 修改 - 分发 - 专利使用 - 私人使用 限制条款 - 责任限制 - 担保限制 许可条件 - 保留许可证与版权声明 - 说明变更内容 - 披露源代码 - 使用相同许可证进行分发 ### 引用信息 若您使用本数据集,请引用以下论文: [根据癌症标志对科学文献进行自动语义分类](https://academic.oup.com/bioinformatics/article/32/3/432/1743783) bibtex @article{baker2015automatic, title={Automatic semantic classification of scientific literature according to the hallmarks of cancer}, author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={32}, number={3}, pages={432--440}, year={2015}, publisher={Oxford University Press} } [癌症标志分析工具(CHAT):用于整理与评估癌症相关科学文献的文本挖掘方法](https://www.repository.cam.ac.uk/bitstream/handle/1810/265268/btx454.pdf?sequence=8&isAllowed=y) bibtex @article{baker2017cancer, title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer}, author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={33}, number={24}, pages={3973--3981}, year={2017}, publisher={Oxford University Press} } [基于卷积神经网络的癌症标志文本分类](https://www.repository.cam.ac.uk/bitstream/handle/1810/270037/BIOTXTM2016.pdf?sequence=1&isAllowed=y) bibtex @article{baker2017cancer, title={Cancer hallmark text classification using convolutional neural networks}, author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo}, year={2016} } [为分层多标签文本分类初始化神经网络](http://www.aclweb.org/anthology/W17-2339) bibtex @article{baker2017initializing, title={Initializing neural networks for hierarchical multi-label text classification}, author={Baker, Simon and Korhonen, Anna}, journal={BioNLP 2017}, pages={307--315}, year={2017} }
提供机构:
qanastek
原始信息汇总

数据集概述

数据集描述

数据集摘要

Hallmarks of Cancer Corpus (HOC) 是一个用于文本分类的数据集,包含1852篇PubMed出版物摘要,由专家根据一个包含37个类别的层次结构进行手动标注。每个句子可以被分配零个或多个类别标签。标签存储在"labels"目录下,标记化的文本存储在"text"目录下,文件名是对应的PubMed ID (PMID)。

支持的任务和排行榜

该数据集可用于训练多类别分类模型。

语言

数据集中的文本仅包含英文:

  • 英语 - 美国 (en-US)

数据集结构

数据实例

json { "document_id": "12634122_5", "text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .", "label": [9, 5, 0, 6] }

数据字段

  • document_id: 文档的唯一标识符。
  • text: PubMed摘要的原始文本。
  • label: 10个已知的癌症特征之一。

数据分割

10个癌症特征的数据分布:

Hallmark No. abstracts No. sentences
1. PS 462 993
2. GS 242 468
3. CD 430 883
4. RI 115 295
5. A 143 357
6. IM 291 667
7. GI 333 771
8. TPI 194 437
9. CE 105 213
10. ID 108 226

数据集创建

源数据

谁是源语言的生产者?

该语料库由Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla和Korhonen Anna生产和上传。

个人和敏感信息

该语料库不包含个人或敏感信息。

附加信息

数据集策展人

  • HoC: Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla、Korhonen Anna
  • Hugging Face: Labrak Yanis(与原始语料库无关)

许可信息

plain GNU General Public License v3.0

引用信息

我们非常感谢您引用我们的出版物:

Automatic semantic classification of scientific literature according to the hallmarks of cancer

bibtex @article{baker2015automatic, title={Automatic semantic classification of scientific literature according to the hallmarks of cancer}, author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={32}, number={3}, pages={432--440}, year={2015}, publisher={Oxford University Press} }

Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer

bibtex @article{baker2017cancer, title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer}, author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna}, journal={Bioinformatics}, volume={33}, number={24}, pages={3973--3981}, year={2017}, publisher={Oxford University Press} }

Cancer hallmark text classification using convolutional neural networks

bibtex @article{baker2017cancer, title={Cancer hallmark text classification using convolutional neural networks}, author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo}, year={2016} }

Initializing neural networks for hierarchical multi-label text classification

bibtex @article{baker2017initializing, title={Initializing neural networks for hierarchical multi-label text classification}, author={Baker, Simon and Korhonen, Anna}, journal={BioNLP 2017}, pages={307--315}, year={2017} }

搜集汇总
数据集介绍
main_image_url
构建方式
HoC数据集的构建,是以PubMed上发表的1852篇关于癌症研究的论文摘要为基础,由专家手动按照37类的层级分类法进行标注。每一句文本都被赋予了一个或多个与癌症十大特征相关的标签,构建了一个适用于文本分类任务的数据集。
特点
该数据集的特点在于其专业性、精确性及多标签分类的特性。专业性体现在所有数据均来源于PubMed上的癌症研究论文摘要;精确性则在于专家的细致标注,确保了每个标签的准确性;多标签分类特性则允许每个句子同时属于多个癌症特征类别。
使用方法
使用HoC数据集,用户首先需要通过HuggingFace的load_dataset函数加载该数据集。数据集加载后,用户可以根据需要选择训练集、验证集或测试集进行模型训练或评估。数据集以JSON格式存储,包含文档ID、文本和标签字段,方便用户进行数据预处理和模型构建。
背景与挑战
背景概述
Hallmarks of Cancer Corpus(HoC)是由Baker Simon等研究人员创建的一个文本分类数据集,旨在对科学文献进行自动语义分类,归类为癌症的十个标志性特征。该数据集于2015年创建,包含1852篇PubMed出版物摘要,由专家手动注释并按照37类的层级分类法进行分类。HoC数据集对癌症研究领域具有显著影响,为研究人员提供了一种高效的分析工具,以组织和评估与癌症相关的科学文献。
当前挑战
该数据集在构建过程中遇到的挑战主要包括:1)对大量PubMed文献进行精确的专家标注,确保分类的质量和准确性;2)处理文本数据中的多标签问题,因为每个句子可能对应多个癌症标志性特征;3)构建一个能够处理层级多标签分类问题的神经网络模型。此外,数据集在应用层面的挑战包括如何将其融入实际研究,以解决领域问题,例如提高癌症诊断的准确性和治疗效果的预测。
常用场景
经典使用场景
在文本分类领域,HoC数据集的经典使用场景是对科学文献进行自动语义分类,具体而言,是根据癌症的标志性特征对PubMed摘要进行多标签分类。该数据集为研究人员提供了一种方式,能够将文献摘要与癌症的十大标志性特征相关联,从而实现对相关研究的快速检索和分析。
衍生相关工作
基于HoC数据集,衍生出了多项相关工作,包括但不限于开发新的文本分类算法、构建癌症标志性特征的分析工具以及神经网络初始化策略的研究。这些工作不仅推动了文本分类技术在生物医学领域的应用,也为癌症研究提供了新的方法和工具。
数据集最近研究
最新研究方向
在癌症研究领域,HoC数据集以其独特的层级多标签分类特性,成为学术文献自动语义分类的重要资源。近期研究集中于深度学习模型的运用,如卷积神经网络(CNN)在文本分类中的效果,以及神经网络初始化策略以提高分类准确率。此外,研究还聚焦于如何通过文本挖掘方法,如Cancer Hallmarks Analytics Tool(CHAT),来组织和评估关于癌症的科学文献。这些研究不仅推动了癌症 hallmark 的理解,也为生物信息学和医学文本挖掘领域提供了新的方法论和工具。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作