five

VatsalPatel18/HNSCC-MultiOmics-10-Cancer-Hallmark-Gene-Network

收藏
Hugging Face2024-05-28 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/VatsalPatel18/HNSCC-MultiOmics-10-Cancer-Hallmark-Gene-Network
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 --- # HNSCC MultiOmics Cancer Gene Hallmark Network Patient Dataset This dataset consists of network and adjacency matrix files related to head and neck squamous cell carcinoma (HNSCC) patient data. It is intended for research purposes in the field of cancer genomics and network analysis. ## License This dataset is available under the cc-by-nc-sa-4.0 License. ## Dataset Files There are three main files in this dataset: 1. **hnscc.patient.chg.network.pth**: This is a PyTorch dictionary where keys are patient IDs and values are data in PyTorch Geometric format. Each entry represents the network data associated with a specific patient. 2. **hnsc.edges.npy**: This file contains the adjacency matrix in NumPy format for cancer hallmark genesets. It represents the connectivity between different genes associated with cancer. 3. **additional_data_file.extension**: (Replace with actual file name and extension) Describe what this file contains and how it can be used. ## Data Format - The `.pth` file format is specific to PyTorch and should be used with PyTorch Geometric for loading the data. - The `.npy` file is a NumPy binary format for storing arrays, suitable for loading with NumPy's `load` function. ## Network Construction In constructing the network for our Graph Attention Autoencoder, we utilized the Cancer Hallmark geneset, which comprises 2,784 genes. This geneset serves as the basis for building the graph’s nodes. The edges between these nodes are defined in a weighted manner, based on the shared pathways between gene pairs. Specifically, the weight assigned to an edge connecting any two genes is determined by the number of pathways in which both genes are concurrently present. For example, if a pair of genes appears together in 5 different pathways, the edge connecting these two genes in our graph is assigned a weight of 5. This approach allows us to capture not only the dependent interactions between genes but also their independent behaviors. Ideal gene-gene graph networks are constructed using the cancer hallmark geneset, resulting in 3,672,566 weighted edges among the 2,784 genes. This network construction was performed for each of the 430 patients in the TCGA-HNSCC cohort. A Graph Attention Autoencoder is trained on a dataset split, with 60% for training and the rest divided between validation and testing. This model achieves a validation cosine similarity of 0.835 and a test set cosine similarity of 0.8, measuring the similarity in the input multiomic features per node to recreated features. Latent features for each gene of every patient are extracted from the Graph Encoder, effectively reducing the gene dimensionality from R17 to R1, while encapsulating the influence of cancer hallmark pathways. ## References - Zhang D, Huo D, Xie H, Wu L, Zhang J, Liu L, Jin Q, Chen X. CHG: A Systematically Integrated Database of Cancer Hallmark Genes. Front Genet. 2020 Feb 5;11:29. doi: 10.3389/fgene.2020.00029. PMID: 32117445; PMCID: PMC7013921. ## Downloading and Using the Data To use this dataset, you will need Python installed along with PyTorch and PyTorch Geometric. You can install these packages using pip: ```bash pip install torch torch-geometric numpy ``` ## Using the Dataset ### Loading the Dataset To load the dataset, you can use the following Python code: ```python import torch import numpy as np # Load the PyTorch dictionary graph_data_dict = torch.load('path/to/hnscc.patient.chg.network.pth') # Load the adjacency matrix adjacency_matrix = np.load('path/to/hnsc.edges.npy') # Example of accessing the data for a specific patient patient_id = 'example_patient_id' # Replace with an actual patient ID patient_data = graph_data_dict[patient_id] ``` ### Data Preprocessing Ensure that the gene expression data is standardized and robustly scaled to fall within the range of 0 to 1. The copy number alteration data should be linearly transformed from a discrete variable ranging from -2 to 2 to a continuous representation. Mutation types should be encoded in a binary format, with 1 indicating the presence of a mutation and 0 its absence. Methylation data should be maintained as continuous variables for six gene regions including the 1st exon, 3’UTR, 5’UTR, gene body, TSS1500, and TSS200. ### Constructing the Network The network is constructed using the Cancer Hallmark geneset, which includes 2,784 genes. The edges between these nodes are defined in a weighted manner, based on the shared pathways between gene pairs. The weight assigned to an edge is determined by the number of pathways in which both genes are concurrently present. ### Training the Model After loading and preprocessing the data, you can train the Graph Attention Autoencoder using the provided configuration and data loaders. The model is trained to achieve a validation cosine similarity of 0.835 and a test set cosine similarity of 0.8, measuring the similarity in the input multiomic features per node to recreated features. ### Analysis and Visualization You can perform various analyses and visualizations using the trained model. For example, you can extract latent features, cluster patients into distinct groups, and perform survival analysis. Detailed instructions and code examples for these tasks are provided in the README file of the associated repository. ### Citation If you use this dataset in your research, please cite the following paper: Zhang D, Huo D, Xie H, Wu L, Zhang J, Liu L, Jin Q, Chen X. CHG: A Systematically Integrated Database of Cancer Hallmark Genes. Front Genet. 2020 Feb 5;11:29. doi: 10.3389/fgene.2020.00029. PMID: 32117445; PMCID: PMC7013921.
提供机构:
VatsalPatel18
原始信息汇总

数据集概述

数据集名称

HNSCC MultiOmics Cancer Gene Hallmark Network Patient Dataset

数据集内容

该数据集包含与头颈鳞状细胞癌(HNSCC)患者数据相关的网络和邻接矩阵文件,旨在用于癌症基因组学和网络分析领域的研究。

数据集文件

  1. hnscc.patient.chg.network.pth: PyTorch字典格式,包含患者ID作为键,值为PyTorch Geometric格式的数据。每个条目代表特定患者的网络数据。
  2. hnsc.edges.npy: NumPy格式,包含癌症标志基因集的邻接矩阵,表示不同基因之间的连接性。
  3. additional_data_file.extension: 具体文件名和扩展名未提供,内容描述待补充。

数据格式

  • .pth 文件格式专为PyTorch设计,需使用PyTorch Geometric加载。
  • .npy 文件是NumPy的二进制数组存储格式,适合使用NumPy的load函数加载。

网络构建

使用包含2,784个基因的癌症标志基因集构建网络节点,边根据基因对之间共享的通路数量加权。网络包含3,672,566个加权边,针对TCGA-HNSCC队列中的430名患者构建。

引用文献

Zhang D, Huo D, Xie H, Wu L, Zhang J, Liu L, Jin Q, Chen X. CHG: A Systematically Integrated Database of Cancer Hallmark Genes. Front Genet. 2020 Feb 5;11:29. doi: 10.3389/fgene.2020.00029. PMID: 32117445; PMCID: PMC7013921.

使用数据集

需安装Python、PyTorch和PyTorch Geometric。使用以下命令安装所需包: bash pip install torch torch-geometric numpy

数据集加载示例

python import torch import numpy as np

加载PyTorch字典

graph_data_dict = torch.load(path/to/hnscc.patient.chg.network.pth)

加载邻接矩阵

adjacency_matrix = np.load(path/to/hnsc.edges.npy)

访问特定患者数据示例

patient_id = example_patient_id # 替换为实际患者ID patient_data = graph_data_dict[patient_id]

数据预处理

确保基因表达数据标准化并稳健缩放到0到1的范围内。拷贝数变异数据应从离散变量(范围-2到2)线性转换为连续表示。突变类型应以二进制格式编码,1表示突变存在,0表示不存在。甲基化数据应保持为六个基因区域的连续变量。

网络构建

使用包含2,784个基因的癌症标志基因集构建网络,根据基因对之间共享的通路数量定义加权边。

模型训练

加载和预处理数据后,可使用提供的配置和数据加载器训练Graph Attention Autoencoder。模型训练目标为验证集余弦相似度0.835和测试集余弦相似度0.8。

分析与可视化

使用训练好的模型进行各种分析和可视化,如提取潜在特征、将患者聚类成不同组别并进行生存分析。详细说明和代码示例在相关仓库的README文件中提供。

引用

如在研究中使用此数据集,请引用以下论文: Zhang D, Huo D, Xie H, Wu L, Zhang J, Liu L, Jin Q, Chen X. CHG: A Systematically Integrated Database of Cancer Hallmark Genes. Front Genet. 2020 Feb 5;11:29. doi: 10.3389/fgene.2020.00029. PMID: 32117445; PMCID: PMC7013921.

搜集汇总
数据集介绍
main_image_url
构建方式
在癌症基因组学与网络分析的研究领域中,头颈部鳞状细胞癌(HNSCC)的多组学数据整合是一项关键挑战。该数据集基于癌症标志基因集(Cancer Hallmark geneset)构建,该基因集包含2784个基因,作为图网络的节点。节点间的边以加权方式定义,权重由基因对共享通路的数量决定——若两个基因共同出现在5条通路中,则边权重为5。这种设计不仅捕捉了基因间的依赖交互,还反映了其独立行为。最终,针对TCGA-HNSCC队列中的430名患者,为每位患者构建了包含3672566条加权边的基因-基因图网络。网络数据以PyTorch Geometric格式存储于`hnscc.patient.chg.network.pth`文件中,同时提供NumPy格式的邻接矩阵`hnsc.edges.npy`。
特点
该数据集的核心特点在于其多组学融合与图结构表征的深度整合。首先,网络构建以癌症标志通路为生物学基础,将基因表达、拷贝数变异、突变类型和甲基化数据统一编码为连续或二进制特征,并标准化至0-1区间。其次,通过图注意力自编码器(Graph Attention Autoencoder)训练,模型在验证集和测试集上分别达到0.835和0.8的余弦相似度,成功将每个基因的17维多组学特征压缩为1维潜在表示,同时保留了癌症标志通路的影响。这种降维策略不仅揭示了基因间的复杂关联,还为下游分析提供了高效的特征空间。
使用方法
使用该数据集需配备Python环境及PyTorch、PyTorch Geometric和NumPy库。加载时,通过`torch.load`读取`.pth`文件获取每位患者的图数据字典,利用`np.load`加载邻接矩阵。数据预处理需对基因表达进行稳健缩放至0-1,将拷贝数变异从离散值(-2至2)线性转换为连续值,突变类型二值化,甲基化数据保持六个基因区域的连续变量。随后,基于癌症标志基因集构建加权图网络,训练图注意力自编码器以提取潜在特征。研究人员可进一步对潜在特征进行患者聚类、生存分析等可视化与统计推断,相关代码示例见原始仓库的README文件。
背景与挑战
背景概述
头颈部鳞状细胞癌(HNSCC)作为全球第六大常见恶性肿瘤,其分子异质性与复杂信号网络为精准诊疗带来巨大挑战。该数据集由VatsalPatel18团队基于TCGA-HNSCC队列构建,发表于2023年,聚焦于癌症标志基因(Cancer Hallmark Genes)的跨组学网络建模。核心研究问题在于如何通过整合基因表达、拷贝数变异、突变及甲基化等多组学数据,揭示HNSCC患者中2784个癌症标志基因的协同作用机制。该数据集以图注意力自编码器为框架,为每位患者构建加权基因网络,将基因维度从17维压缩至1维潜在表征,为癌症网络分析领域提供了可复用的基准资源。其影响力体现在:首次将癌症标志基因集与图神经网络结合,实现了多组学数据的结构化降维与交互关系挖掘。
当前挑战
当前数据集面临的核心挑战在于领域问题与构建过程的双重复杂性。在领域层面,癌症标志基因间的交互具有高度动态性与冗余性,传统线性模型难以捕捉非线性的通路共现关系,而图注意力网络虽能学习权重分配,但面对367万余条加权边时,模型易陷入过拟合或梯度弥散,影响泛化能力。构建过程中,多组学数据的异构性构成显著障碍:基因表达需标准化至0-1区间,拷贝数变异需从离散值线性映射为连续表征,突变需二值化编码,甲基化数据的六个区域变量需保持连续特性——这些预处理步骤的偏差会直接扭曲网络拓扑结构。此外,430例患者的样本规模有限,而图模型参数量庞大,导致训练时60%的划分比例可能加剧数据稀疏性,验证集余弦相似度0.835与测试集0.8的差距暗示了跨患者泛化瓶颈。
常用场景
经典使用场景
该数据集的核心应用场景在于利用图注意力自编码器对头颈鳞状细胞癌患者的多组学数据进行整合分析。研究者可借助TCGA-HNSCC队列中430名患者的基因表达、拷贝数变异、突变及甲基化等多模态数据,构建以癌症标志基因集为节点的加权基因网络,并通过图神经网络模型挖掘基因间的协同作用与通路关联。该场景尤其适用于探索癌症标志通路驱动下的基因调控模式,为理解肿瘤发生机制提供网络层面的新视角。
解决学术问题
该数据集旨在解决癌症基因组学中多组学数据整合与高维特征降维的学术难题。传统方法往往独立分析各模态数据,难以捕捉基因间的非线性依赖关系。通过构建2,784个癌症标志基因间基于通路共现权重的网络,并结合图注意力自编码器将每个基因的17维多组学特征压缩至1维潜变量,该研究在保留关键生物信息的前提下实现了高效降维。其验证集和测试集的余弦相似度分别达到0.835和0.8,显著提升了多组学数据重建的准确性,为精准医学中的生物标志物发现奠定了方法论基础。
衍生相关工作
该数据集衍生的经典工作包括基于图注意力自编码器的癌症基因网络重构模型,其核心贡献在于提出了以通路共现权重定义基因间边界的创新思路。后续研究可沿此方向拓展至其他癌症类型,如肺癌或乳腺癌,构建泛癌种标志基因网络。此外,该数据集中的潜变量提取方法已被用于开发生存分析预测框架,通过将降维后的基因特征与临床协变量结合,提升Cox比例风险模型的区分度。相关代码库还提供了患者聚类与网络可视化工具,推动了图神经网络在肿瘤组学领域的标准化应用。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作