five

Genentech/decima-data

收藏
Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Genentech/decima-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - tabular-regression tags: - biology - genomics - single-cell pretty_name: "Decima Dataset" size_categories: - 1M<n<10M --- # decima-data ## Dataset Summary This dataset contains gene expression predictions and associated genomic features formatted as an `AnnData` object. It is designed for use with the **Decima** framework to support tasks such as gene expression prediction and genomic sequence modeling. The data provides a comprehensive view of expression across various tissues, organs, and disease states, primarily centered on human brain atlas data. For more details, please refer to the original paper: https://www.biorxiv.org/content/10.1101/2024.10.09.617507v3. ## Dataset Structure The dataset is an `AnnData` object with dimensions: **8,856 observations (pseudobulks) × 18,457 variables (genes)**. ### Data Fields **In `.obs` (Observation metadata):** | Column | Description | | :--- | :--- | | `cell_type` | Specific cell type label | | `tissue` | Tissue of origin | | `organ` | Organ of origin | | `disease` | Clinical status or condition (e.g., healthy) | | `study` | Source study identifier | | `dataset` | Source dataset identifier | | `region` | Anatomical region | | `subregion` | Specific anatomical subregion | | `celltype_coarse` | Broad cell type classification | | `n_cells` | Number of cells aggregated into the pseudobulk | | `total_counts` | Total read count | | `n_genes` | Number of genes detected | | `size_factor` | Sum after normalization | | `train_pearson` | Pearson correlation on training set | | `val_pearson` | Pearson correlation on validation set | | `test_pearson` | Pearson correlation on test set | **In `.var` (Metadata for variables/genes):** | Column | Description | | :--- | :--- | | `chrom` | Chromosome | | `start` | Genomic start coordinate (hg38) | | `end` | Genomic end coordinate (hg38) | | `strand` | Genomic strand (+/-) | | `gene_type` | Gene biotype (e.g., protein coding) | | `frac_nan` | Fraction of missing values | | `mean_counts` | Average expression counts | | `n_tracks` | Number of pseudobulks expressing the gene | | `gene_start` | Gene start position | | `gene_end` | Gene end position | | `gene_length` | Total length of the gene | | `gene_mask_start` | Start of the gene mask in the input sequence | | `gene_mask_end` | End of the gene mask in the input sequence | | `frac_N` | Fraction of ambiguous bases (N) in the input | | `fold` | Borzoi fold assignment | | `dataset` | Split assignment (e.g., train, test) | | `gene_id` | Ensembl gene identifier | | `pearson` | Overall Pearson correlation | | `size_factor_pearson` | Pearson correlation using size factor | | `ensembl_canonical_tss` | Canonical Transcription Start Site | ### Data Layers * **`.layers['preds']`**: Predicted values from the Decima model. * **`.layers['v1_rep0']` through `.layers['v1_rep3']`**: Predictions from four model replicates. ## Usage ```python import anndata from huggingface_hub import hf_hub_download file_path = hf_hub_download( repo_id="Genentech/decima-data", repo_type="dataset", filename="metadata.h5ad" ) adata = anndata.read_h5ad(file_path) ```
提供机构:
Genentech
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作