---
annotations_creators:
- machine-generated
- expert-generated
language_creators:
- found
language:
- en
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
pretty_name: HoC
language_bcp47:
- en-US
---
# HoC : Hallmarks of Cancer Corpus
## Table of Contents
- [Dataset Card for [Needs More Information]](#dataset-card-for-needs-more-information)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [No Warranty](#no-warranty)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://s-baker.net/resource/hoc/
- **Repository:** https://github.com/sb895/Hallmarks-of-Cancer
- **Paper:** https://academic.oup.com/bioinformatics/article/32/3/432/1743783
- **Leaderboard:** https://paperswithcode.com/dataset/hoc-1
- **Point of Contact:** [Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr)
### Dataset Summary
The Hallmarks of Cancer Corpus for text classification
The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).
In addition to the HOC corpus, we also have the [Cancer Hallmarks Analytics Tool](http://chat.lionproject.net/) which classifes all of PubMed according to the HoC taxonomy.
### Supported Tasks and Leaderboards
The dataset can be used to train a model for `multi-class-classification`.
### Languages
The corpora consists of PubMed article only in english:
- `English - United States (en-US)`
## Load the dataset with HuggingFace
```python
from datasets import load_dataset
dataset = load_dataset("qanastek/HoC")
validation = dataset["validation"]
print("First element of the validation set : ", validation[0])
```
## Dataset Structure
### Data Instances
```json
{
"document_id": "12634122_5",
"text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .",
"label": [9, 5, 0, 6]
}
```
### Data Fields
`document_id`: Unique identifier of the document.
`text`: Raw text of the PubMed abstracts.
`label`: One of the 10 currently known hallmarks of cancer.
| Hallmark | Search term |
|:-------------------------------------------:|:-------------------------------------------:|
| 1. Sustaining proliferative signaling (PS) | Proliferation Receptor Cancer |
| | 'Growth factor' Cancer |
| | 'Cell cycle' Cancer |
| 2. Evading growth suppressors (GS) | 'Cell cycle' Cancer |
| | 'Contact inhibition' |
| 3. Resisting cell death (CD) | Apoptosis Cancer |
| | Necrosis Cancer |
| | Autophagy Cancer |
| 4. Enabling replicative immortality (RI) | Senescence Cancer |
| | Immortalization Cancer |
| 5. Inducing angiogenesis (A) | Angiogenesis Cancer |
| | 'Angiogenic factor' |
| 6. Activating invasion & metastasis (IM) | Metastasis Invasion Cancer |
| 7. Genome instability & mutation (GI) | Mutation Cancer |
| | 'DNA repair' Cancer |
| | Adducts Cancer |
| | 'Strand breaks' Cancer |
| | 'DNA damage' Cancer |
| 8. Tumor-promoting inflammation (TPI) | Inflammation Cancer |
| | 'Oxidative stress' Cancer |
| | Inflammation 'Immune response' Cancer |
| 9. Deregulating cellular energetics (CE) | Glycolysis Cancer; 'Warburg effect' Cancer |
| 10. Avoiding immune destruction (ID) | 'Immune system' Cancer |
| | Immunosuppression Cancer |
### Data Splits
Distribution of data for the 10 hallmarks:
| **Hallmark** | **No. abstracts** | **No. sentences** |
|:------------:|:-----------------:|:-----------------:|
| 1. PS | 462 | 993 |
| 2. GS | 242 | 468 |
| 3. CD | 430 | 883 |
| 4. RI | 115 | 295 |
| 5. A | 143 | 357 |
| 6. IM | 291 | 667 |
| 7. GI | 333 | 771 |
| 8. TPI | 194 | 437 |
| 9. CE | 105 | 213 |
| 10. ID | 108 | 226 |
## Dataset Creation
### Source Data
#### Who are the source language producers?
The corpus has been produced and uploaded by Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna.
### Personal and Sensitive Information
The corpora is free of personal or sensitive information.
## Additional Information
### Dataset Curators
__HoC__: Baker Simon and Silins Ilona and Guo Yufan and Ali Imran and Hogberg Johan and Stenius Ulla and Korhonen Anna
__Hugging Face__: Labrak Yanis (Not affiliated with the original corpus)
### Licensing Information
```plain
GNU General Public License v3.0
```
```plain
Permissions
- Commercial use
- Modification
- Distribution
- Patent use
- Private use
Limitations
- Liability
- Warranty
Conditions
- License and copyright notice
- State changes
- Disclose source
- Same license
```
### Citation Information
We would very much appreciate it if you cite our publications:
[Automatic semantic classification of scientific literature according to the hallmarks of cancer](https://academic.oup.com/bioinformatics/article/32/3/432/1743783)
```bibtex
@article{baker2015automatic,
title={Automatic semantic classification of scientific literature according to the hallmarks of cancer},
author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
journal={Bioinformatics},
volume={32},
number={3},
pages={432--440},
year={2015},
publisher={Oxford University Press}
}
```
[Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer](https://www.repository.cam.ac.uk/bitstream/handle/1810/265268/btx454.pdf?sequence=8&isAllowed=y)
```bibtex
@article{baker2017cancer,
title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer},
author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{\"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
journal={Bioinformatics},
volume={33},
number={24},
pages={3973--3981},
year={2017},
publisher={Oxford University Press}
}
```
[Cancer hallmark text classification using convolutional neural networks](https://www.repository.cam.ac.uk/bitstream/handle/1810/270037/BIOTXTM2016.pdf?sequence=1&isAllowed=y)
```bibtex
@article{baker2017cancer,
title={Cancer hallmark text classification using convolutional neural networks},
author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo},
year={2016}
}
```
[Initializing neural networks for hierarchical multi-label text classification](http://www.aclweb.org/anthology/W17-2339)
```bibtex
@article{baker2017initializing,
title={Initializing neural networks for hierarchical multi-label text classification},
author={Baker, Simon and Korhonen, Anna},
journal={BioNLP 2017},
pages={307--315},
year={2017}
}
```
annotations_creators:
- 机器生成
- 专家生成
language_creators:
- 采集获取
language:
- 英语
size_categories:
- 1000 < 样本数 < 10000
source_datasets:
- 原创数据集
task_categories:
- 文本分类(text-classification)
task_ids:
- 多类别分类(multi-class-classification)
pretty_name: HoC
language_bcp47:
- en-US(美式英语)
# HoC:癌症标志语料库(Hallmarks of Cancer Corpus)
## 目录
- [需补充更多信息的数据集卡片](#dataset-card-for-needs-more-information)
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建依据](#curation-rationale)
- [源数据](#source-data)
- [初始数据收集与标准化](#initial-data-collection-and-normalization)
- [文本来源生产者是谁?](#who-are-the-source-language-producers)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [无担保声明](#no-warranty)
- [引用信息](#citation-information)
## 数据集描述
- **主页**:https://s-baker.net/resource/hoc/
- **代码仓库**:https://github.com/sb895/Hallmarks-of-Cancer
- **相关论文**:https://academic.oup.com/bioinformatics/article/32/3/432/1743783
- **排行榜**:https://paperswithcode.com/dataset/hoc-1
- **联系人**:[Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr)
### 数据集概览
本癌症标志语料库(Hallmarks of Cancer Corpus,简称HoC)用于文本分类任务。
HoC语料库包含1852篇PubMed(美国国家医学图书馆文献数据库)文献摘要,由专家依据分层分类体系手动标注。该分类体系共包含37个层级类别。语料库中的每个句子可被分配零个或多个类别标签。标签文件存储于「labels」目录,分词后的文本存储于「text」目录,文件名即为对应的PubMed编号(PMID)。
除HoC语料库外,我们还开发了**癌症标志分析工具(Cancer Hallmarks Analytics Tool,CHAT)**,可依据HoC分类体系对所有PubMed文献进行分类。
### 支持任务与排行榜
本数据集可用于训练多类别分类(multi-class-classification)模型。
### 语言
本语料库仅包含英文PubMed文献:
- 美式英语(en-US)
## 使用Hugging Face加载数据集
python
from datasets import load_dataset
dataset = load_dataset("qanastek/HoC")
validation = dataset["validation"]
print("First element of the validation set : ", validation[0])
## 数据集结构
### 数据样例
json
{
"document_id": "12634122_5",
"text": "Genes that were overexpressed in OM3 included oncogenes , cell cycle regulators , and those involved in signal transduction , whereas genes for DNA repair enzymes and inhibitors of transformation and metastasis were suppressed .",
"label": [9, 5, 0, 6]
}
### 数据字段
`document_id`:文档的唯一标识符。
`text`:PubMed摘要的原始文本。
`label`:属于10种当前已知的癌症标志之一。
| 癌症标志 | 检索词 |
|:-------------------------------------------:|:-------------------------------------------:
| 1. 持续增殖信号(Sustaining proliferative signaling,PS) | 增殖受体 癌症
| | '生长因子' 癌症
| | '细胞周期' 癌症
| 2. 规避生长抑制(Evading growth suppressors,GS) | '细胞周期' 癌症
| | '接触抑制'
| 3. 抵抗细胞死亡(Resisting cell death,CD) | 细胞凋亡 癌症
| | 细胞坏死 癌症
| | 细胞自噬 癌症
| 4. 获得复制永生性(Enabling replicative immortality,RI) | 细胞衰老 癌症
| | 细胞永生化 癌症
| 5. 诱导血管生成(Inducing angiogenesis,A) | 血管生成 癌症
| | '血管生成因子'
| 6. 激活侵袭与转移(Activating invasion & metastasis,IM) | 转移 侵袭 癌症
| 7. 基因组不稳定性与突变(Genome instability & mutation,GI) | 突变 癌症
| | 'DNA修复' 癌症
| | DNA加合物 癌症
| | 'DNA链断裂' 癌症
| | 'DNA损伤' 癌症
| 8. 促肿瘤炎症(Tumor-promoting inflammation,TPI) | 炎症 癌症
| | '氧化应激' 癌症
| | 炎症 '免疫应答' 癌症
| 9. 细胞能量代谢异常(Deregulating cellular energetics,CE) | 糖酵解 癌症; '瓦博格效应' 癌症
| 10. 逃逸免疫破坏(Avoiding immune destruction,ID) | '免疫系统' 癌症
| | 免疫抑制 癌症
### 数据划分
10种癌症标志的数据分布如下:
| **癌症标志** | **摘要数量** | **句子数量** |
|:------------:|:-----------------:|:----------------:|
| 1. PS | 462 | 993
| 2. GS | 242 | 468
| 3. CD | 430 | 883
| 4. RI | 115 | 295
| 5. A | 143 | 357
| 6. IM | 291 | 667
| 7. GI | 333 | 771
| 8. TPI | 194 | 437
| 9. CE | 105 | 213
| 10. ID | 108 | 226
## 数据集构建
### 源数据
#### 文本来源生产者
本语料库由Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla以及Korhonen Anna制作并上传。
### 个人与敏感信息
本语料库不包含任何个人或敏感信息。
## 附加信息
### 数据集维护者
**HoC原维护者**:Baker Simon、Silins Ilona、Guo Yufan、Ali Imran、Hogberg Johan、Stenius Ulla、Korhonen Anna
**Hugging Face适配维护者**:Labrak Yanis(与原语料库无关联)
### 许可信息
plain
GNU通用公共许可证v3.0(GNU General Public License v3.0)
plain
许可权限
- 商业使用
- 修改
- 分发
- 专利使用
- 私人使用
限制条款
- 责任限制
- 担保限制
许可条件
- 保留许可证与版权声明
- 说明变更内容
- 披露源代码
- 使用相同许可证进行分发
### 引用信息
若您使用本数据集,请引用以下论文:
[根据癌症标志对科学文献进行自动语义分类](https://academic.oup.com/bioinformatics/article/32/3/432/1743783)
bibtex
@article{baker2015automatic,
title={Automatic semantic classification of scientific literature according to the hallmarks of cancer},
author={Baker, Simon and Silins, Ilona and Guo, Yufan and Ali, Imran and H{"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
journal={Bioinformatics},
volume={32},
number={3},
pages={432--440},
year={2015},
publisher={Oxford University Press}
}
[癌症标志分析工具(CHAT):用于整理与评估癌症相关科学文献的文本挖掘方法](https://www.repository.cam.ac.uk/bitstream/handle/1810/265268/btx454.pdf?sequence=8&isAllowed=y)
bibtex
@article{baker2017cancer,
title={Cancer Hallmarks Analytics Tool (CHAT): a text mining approach to organize and evaluate scientific literature on cancer},
author={Baker, Simon and Ali, Imran and Silins, Ilona and Pyysalo, Sampo and Guo, Yufan and H{"o}gberg, Johan and Stenius, Ulla and Korhonen, Anna},
journal={Bioinformatics},
volume={33},
number={24},
pages={3973--3981},
year={2017},
publisher={Oxford University Press}
}
[基于卷积神经网络的癌症标志文本分类](https://www.repository.cam.ac.uk/bitstream/handle/1810/270037/BIOTXTM2016.pdf?sequence=1&isAllowed=y)
bibtex
@article{baker2017cancer,
title={Cancer hallmark text classification using convolutional neural networks},
author={Baker, Simon and Korhonen, Anna-Leena and Pyysalo, Sampo},
year={2016}
}
[为分层多标签文本分类初始化神经网络](http://www.aclweb.org/anthology/W17-2339)
bibtex
@article{baker2017initializing,
title={Initializing neural networks for hierarchical multi-label text classification},
author={Baker, Simon and Korhonen, Anna},
journal={BioNLP 2017},
pages={307--315},
year={2017}
}