Phenotype Driven Data Augmentation Methods for Transcriptomic Data

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/8383202

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains the data and associated results of all experiments conducted in our work "Phenotype Driven Data Augmentation Methods for Transcriptomic Data". In this work, we introduce two classes of phenotype driven data augmentation approaches – signature-dependent and signature-independent. The signature-dependent methods assume the existence of distinct gene signatures describing some phenotype and are simple, non-parametric, and novel data augmentation methods. The signature-independent methods are a modification of the established Gamma-Poisson and Poisson sampling methods for gene expression data. We benchmark our proposed methods against random oversampling, SMOTE, unmodified versions of Gamma-Poisson and Poisson sampling, and unaugmented data. This repository contains data used for all our experiments. This includes the original data based off which augmentation was performed, the cross validation split indices as a json file, the training and validation data augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples) and an external test set standardised accordingly with respect to each augmentation method and training data per CV split. The compressed files 5x5stratified_{x}percent.zip contains data that were augmented on x% of the available real data. brca_public.zip contains data used for the breast cancer experiments. distribution_size_effect.zip contains data used for hyperparameter tuning the reference set size for the modified Poisson and Gamma-Poisson methods. The compressed file results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics (balanced accuracy and auc-roc) computed including p-values, as well as the latent space of train, validation and test (for the (N)VAE) for all 25 (5x5) CV splits. PLEASE NOTE: If any part of this repository is used in any form for your work, please attribute the following, in addition to attributing the original data source - TCGA, CPTAC, GSE20713 and METABRIC, accordingly: @article{janakarajan2023signature, title={Phenotype Driven Data Augmentation Methods for Transcriptomic Data}, author={Janakarajan, Nikita and Graziani, Mara and Martinez, Maria Rodriguez}, journal={bioRxiv}, pages={2023--10}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

创建时间：

2025-03-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集