Name: nds029/Tahoe-100M
Creator: nds029
Published: 2026-02-01 00:31:46
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/nds029/Tahoe-100M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 tags: - biology - single-cell - RNA - chemistry size_categories: - 100M<n<1B configs: - config_name: expression_data data_files: data/train-* default: true - config_name: sample_metadata data_files: metadata/sample_metadata.parquet - config_name: gene_metadata data_files: metadata/gene_metadata.parquet - config_name: drug_metadata data_files: metadata/drug_metadata.parquet - config_name: cell_line_metadata data_files: metadata/cell_line_metadata.parquet - config_name: obs_metadata data_files: metadata/obs_metadata.parquet - config_name: pseudobulk_differential_expression data_files: metadata/pseudobulk_differential_expression/train-* dataset_info: features: - name: genes sequence: int64 - name: expressions sequence: float32 - name: drug dtype: string - name: sample dtype: string - name: BARCODE_SUB_LIB_ID dtype: string - name: cell_line_id dtype: string - name: moa-fine dtype: string - name: canonical_smiles dtype: string - name: pubchem_cid dtype: string - name: plate dtype: string splits: - name: train num_bytes: 1693653078843 num_examples: 95624334 download_size: 337644770670 dataset_size: 1693653078843 --- # Tahoe-100M Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine. **Preprint**: [Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling](https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1) <img src="https://pbs.twimg.com/media/Gkpp8RObkAM-fxe?format=jpg&name=4096x4096" width="1024" height="1024"> ## Quickstart ```python from datasets import load_dataset # Load dataset in streaming mode ds = load_dataset("tahoebio/Tahoe-100m", streaming=True, split="train") # View the first record next(ds.iter(1)) ``` ### Tutorials Please refer to our tutorials for examples on using the data, accessing metadata tables and converting to/from the anndata format. Please see the [Data Loading Tutorial](tutorials/loading_data.ipynb) for a walkthrough on using the data. <table> <thead> <tr> <th>Notebook</th> <th>URL</th> <th>Colab</th> </tr> </thead> <tbody> <tr> <td>Loading the dataset from huggingface, accessing metadata, mapping to anndata</td> <td> <a href="https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank"> Link </a> </td> <td> <a href="https://colab.research.google.com/#fileId=https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> </a> </td> </tr> </tbody> </table> ### Community Resources Here are a links to few resources created by the community. We would love to feature additional tutorials from the community, if you have built something on top of Tahoe-100M, please let us know and we would love to feature your work. <table> <thead> <tr> <th>Resource</th> <th>Contributor</th> <th>URL</th> </tr> </thead> <tbody> <tr> <td>Analysis guide for Tahoe-100M using rapids-single-cell, scanpy and dask</td> <td><a href="https://github.com/scverse" target="_blank">SCVERSE</a></td> <td><a href="https://github.com/theislab/vevo_Tahoe_100m_analysis/tree/tahoe-DGX-fix" target="_blank">Link</a></td> </tr> <tr> <td>Tutorial for accessing Tahoe-100M h5ad files hosted by the Arc Institute</td> <td><a href="https://github.com/ArcInstitute" target="_blank">Arc Institute</a></td> <td><a href="https://github.com/ArcInstitute/arc-virtual-cell-atlas/blob/main/tahoe-100M/tutorial-py.ipynb" target="_blank">Link</a></td> </tr> </tbody> </table> ## Dataset Features We provide multiple tables with the dataset including the main data (raw counts) in the `expression_data` table as well as various metadata in the `gene_metadata`,`sample_metadata`,`drug_metadata`,`cell_line_metadata`,`obs_metadata` tables. The main data can be downloaded as follows: ```python from datasets import load_dataset tahoe_100m_ds = load_dataset("tahoebio/Tahoe-100M", streaming=True, split="train") ``` Setting `stream=True` instantiates an `IterableDataset` and prevents needing to download the full dataset first. See [tutorial](tutorials/loading_data.ipynb) for an end-to-end example. The expression_data table has the following fields: | **Field Name** | **Type** | **Description** | |------------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `genes` | `sequence<int64>` | Gene identifiers (integer token IDs) corresponding to each gene with non-zero expression in the cell. This sequence aligns with the `expressions` field. The gene_metadata table can be used to map the token_IDs to gene_symbols or ensembl_IDs. The first entry for each row is just a marker token and should be ignored (See [data-loading tutorial](tutorials/loading_data.ipynb)) | | `expressions` | `sequence<float32>` | Raw count values for each gene, aligned with the `genes` field. The first entry just marks a CLS token and should be ignored when parsing. | | `drug` | `string` | Name of the treatment. DMSO_TF marks vehicle controls, use DMSO_TF along with plate to get plate matched controls. | | `sample` | `string` | Unique identifier for the sample from which the cell was derived. Can be used to merge information from the `sample_metadata` table. Distinguishes replicate treatments. | | `BARCODE_SUB_LIB_ID`| `string` | Combination of barcode and sublibary identifiers. Unique for each cell in the dataset. Can be used as an index key when referencing to the `obs_metadata` table. | | `cell_line_id` | `string` | Unique identifier for the cancer cell line from which the cell originated. We use Cellosaurus IDs were, but additional identifiers such as DepMap IDs are provided in the `cell_line_metadata` table. | | `moa-fine` | `string` | Fine-grained mechanism of action (MOA) annotation for the drug, specifying the biological process or molecular target affected. Derived from MedChemExpress and curated with GPT-based annotations. | | `canonical_smiles` | `string` | Canonical SMILES (Simplified Molecular Input Line Entry System) string representing the molecular structure of the perturbing compound. | | `pubchem_cid` | `string` | PubChem Compound Identifier for the drug, allowing cross-referencing with public chemical databases. An empty string is used for DMSO controls. Please cast to int before querrying pubchem. | | `plate` | `string` | Identifier for the 96-well plate (1–14) in which the mixed-cell spheroid was seeded and treated. | ## Additional metadata ### Gene Metadata ```python gene_metadata = load_dataset("taheobio/Tahoe-100M","gene_metadata", split="train") ``` | Column Name | Description | |---------------|-------------------------------------------------------------------------------------------------------------| | `gene_symbol` | The HGNC-approved gene symbol corresponding to each gene (e.g., *TP53*, *BRCA1*). | | `ensembl_id` | The Ensembl gene identifier (e.g., *ENSG00000000003*) based on Ensembl release 109 and genome build 38. | | `token_id` | An integer token ID used to represent each gene. This is the ID used in the `genes` field in the main data. | ### Sample Metadata ```python sample_metadata = load_dataset("tahoebio/Tahoe-100M","sample_metadata", split="train") ``` The sample_metadata has additional information for aggregate quality metrics for the sample as well as the concentration. | Column Name | Description | |------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `sample` | Unique identifier for the sample from which the cell was derived. Unique key for this table. | | `plate` | Identifier (1–14) for the 96-well plate for the sample | | `mean_gene_count` | Average number of unique genes detected per cell for the given sample. | | `mean_tscp_count` | Average number of transcripts (UMIs) detected per cell in the sample. | | `mean_mread_count` | Average number of reads per cell. | | `mean_pcnt_mito` | Mean percentage of total reads that map to mitochondrial genes, across cells in the sample. | | `drug` | Name of the treatment used to perturb the cells in the sample. | | `drugname_drugconc` | String combining the compound name, concentration and concentration unit (e.g., `[('8-Hydroxyquinoline',0.05,'uM')]`), used to uniquely label each treatment condition. | ### Drug Metadata ```python drug_metadata = load_dataset("tahoebio/Tahoe-100M","drug_metadata", split="train") ``` The drug_metadata has additional information about each treatment. | Column Name | Description | |------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `drug` | Name of the treatment used to perturb the cells in the sample. Unique key for this table | | `targets` | List of gene symbols representing the known molecular targets of the compound. Targets were proposed by GPT-4o based on compound names and then validated against MedChemExpress information. | | `moa-broad` | Broad classification of the compound’s mechanism of action (MOA), typically categorized as "inhibitor/antagonist," "activator/agonist," or "unclear." GPT-4o inferred this using compound target data and curated descriptions from MedChemExpress. | | `moa-fine` | Specific functional annotation of the compound's MOA (e.g., "Proteasome inhibitor" or "MEK inhibitor"). These fine-grained labels were selected from a curated list of 25 MOA categories and assigned by GPT-4o with validation against compound descriptions. | | `human-approved` | Indicates whether the compound is approved for human use ("yes" or "no"). GPT-4o provided these labels using prior knowledge and validation from public sources such as clinicaltrials.gov. | | `clinical-trials` | Indicates whether the compound has been evaluated in any registered clinical trials ("yes" or "no"). Determined using GPT-4o and corroborated using clinicaltrials.gov searches. | | `gpt-notes-approval` | Contextual notes generated by GPT-4o summarizing the compound’s approval status, common clinical usage, or nuances such as formulation-specific approvals. | | `canonical_smiles` | The compound's SMILES (Simplified Molecular Input Line Entry System) representation, capturing its molecular structure as a text string. | | `pubchem_cid` | The PubChem Compound Identifier (CID), a unique numerical ID linking the compound to its entry in the PubChem database. | ### Cell Line Metadata ```python cell_line_metadata = load_dataset("tahoebio/Tahoe-100M","cell_line_metadata", split="train") ``` The cell-line metadata table has additional information about the key driver mutations for each cell line. | Column Name | Description | |----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cell_name` | Standard name of the cancer cell line (e.g., *A549*). | | `Cell_ID_DepMap` | Unique identifier for the cell line in the DepMap project (e.g., *ACH-000681*) | | `Cell_ID_Cellosaur` | Cellosaurus accession ID (e.g., *CVCL_0023*). This is the ID used in the main dataset. | | `Organ` | Tissue or organ of origin for the cell line (e.g., *Lung*), used to interpret lineage-specific responses and biological context. | | `Driver_Gene_Symbol` | HGNC-approved symbol of a known or putative driver gene with functional alterations in this cell line (e.g., *KRAS*, *CDKN2A*). We report a curated list of driver mutations per cell-line. | | `Driver_VarZyg` | Zygosity of the driver variant (e.g., *Hom* for homozygous, *Het* for heterozygous) | | `Driver_VarType` | Type of genetic alteration (e.g., *Missense*, *Frameshift*, *Stopgain*, *Deletion*) | | `Driver_ProtEffect_or_CdnaEffect`| Specific protein or cDNA-level annotation of the mutation (e.g., *p.G12S*, *p.Q37*), providing precise information on the variant’s consequence. | | `Driver_Mech_InferDM` | Inferred functional mechanism of the mutation (e.g., *LoF* for loss-of-function, *GoF* for gain-of-function) | | `Driver_GeneType_DM` | Classification of the driver gene as an *Oncogene* or *Suppressor* | ## Citation Please cite: ``` @article{zhang2025tahoe, title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling}, author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} } ```

许可证：CC0-1.0 标签： - 生物学 - 单细胞 - RNA - 化学数据规模分类： - 100M < n < 1B 配置项： - 配置名称：expression_data 数据文件：data/train-* 默认启用：是 - 配置名称：sample_metadata 数据文件：metadata/sample_metadata.parquet - 配置名称：gene_metadata 数据文件：metadata/gene_metadata.parquet - 配置名称：drug_metadata 数据文件：metadata/drug_metadata.parquet - 配置名称：cell_line_metadata 数据文件：metadata/cell_line_metadata.parquet - 配置名称：obs_metadata 数据文件：metadata/obs_metadata.parquet - 配置名称：pseudobulk_differential_expression 数据文件：metadata/pseudobulk_differential_expression/train-* 数据集信息：特征： - 名称：genes 序列类型：int64 - 名称：expressions 序列类型：float32 - 名称：drug 数据类型：字符串 - 名称：sample 数据类型：字符串 - 名称：BARCODE_SUB_LIB_ID 数据类型：字符串 - 名称：cell_line_id 数据类型：字符串 - 名称：moa-fine 数据类型：字符串 - 名称：canonical_smiles 数据类型：字符串 - 名称：pubchem_cid 数据类型：字符串 - 名称：plate 数据类型：字符串划分集： - 名称：train 字节大小：1693653078843 样本数量：95624334 下载总大小：337644770670 数据集存储总大小：1693653078843 # Tahoe-100M 数据集 Tahoe-100M是一款千兆级单细胞扰动图谱，包含来自50种癌细胞系、经1100种小分子扰动处理的超过1亿个转录组谱。该数据集依托Vevo Therapeutics公司的Mosaic高通量平台生成，能够以前所未有的规模与分辨率，深入探索基因功能、细胞状态及药物响应。本数据集旨在推动下一代细胞生物学AI模型的开发，可广泛应用于系统生物学、药物发现与精准医学领域。 **预印本**：[Tahoe-100M：一款面向上下文依赖基因功能与细胞建模的千兆级单细胞扰动图谱](https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1) <img src="https://pbs.twimg.com/media/Gkpp8RObkAM-fxe?format=jpg&name=4096x4096" width="1024" height="1024"> ## 快速入门 python from datasets import load_dataset # 以流式模式加载数据集 ds = load_dataset("tahoebio/Tahoe-100m", streaming=True, split="train") # 查看第一条数据记录 next(ds.iter(1)) ### 教程请参考我们的教程以了解数据使用、元数据表访问以及与anndata格式的相互转换方法。请参阅[数据加载教程](tutorials/loading_data.ipynb)以了解数据使用的完整流程。 <table> <thead> <tr> <th>教程笔记本</th> <th>链接</th> <th>Colab</th> </tr> </thead> <tbody> <tr> <td>从Hugging Face加载数据集、访问元数据、映射至anndata格式</td> <td> <a href="https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank">链接</a> </td> <td> <a href="https://colab.research.google.com/#fileId=https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> </a> </td> </tr> </tbody> </table> ### 社区资源以下是社区创建的部分资源链接。我们欢迎展示更多基于Tahoe-100M开发的社区教程，如果您已基于本数据集完成相关工作，请联系我们以展示您的成果。 <table> <thead> <tr> <th>资源</th> <th>贡献者</th> <th>链接</th> </tr> </thead> <tbody> <tr> <td>基于rapids-single-cell、scanpy与dask的Tahoe-100M分析指南</td> <td><a href="https://github.com/scverse" target="_blank">SCVERSE</a></td> <td><a href="https://github.com/theislab/vevo_Tahoe_100m_analysis/tree/tahoe-DGX-fix" target="_blank">链接</a></td> </tr> <tr> <td>访问Arc Institute托管的Tahoe-100M h5ad文件的教程</td> <td><a href="https://github.com/ArcInstitute" target="_blank">Arc Institute</a></td> <td><a href="https://github.com/ArcInstitute/arc-virtual-cell-atlas/blob/main/tahoe-100M/tutorial-py.ipynb" target="_blank">链接</a></td> </tr> </tbody> </table> ## 数据集特征本数据集提供多张数据表，包括存储原始计数的主数据表`expression_data`，以及`gene_metadata`、`sample_metadata`、`drug_metadata`、`cell_line_metadata`、`obs_metadata`等各类元数据表。主数据可通过以下方式下载： python from datasets import load_dataset tahoe_100m_ds = load_dataset("tahoebio/Tahoe-100M", streaming=True, split="train") 设置`stream=True`将创建一个`IterableDataset`，无需提前下载完整数据集即可使用。详见[教程](tutorials/loading_data.ipynb)中的完整示例。 `expression_data`表包含以下字段： | **字段名称** | **数据类型** | **描述** | |------------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `genes` | `sequence<int64>` | 对应细胞中表达量非零的基因的标识符（整数Token (Token) ID）。该序列与`expressions`字段对齐。可通过`gene_metadata`表将Token ID映射为基因符号或Ensembl ID。每行的第一个条目仅为标记Token，解析时应忽略（详见[数据加载教程](tutorials/loading_data.ipynb)） | | `expressions` | `sequence<float32>` | 每个基因的原始计数数值，与`genes`字段对齐。每行的第一个条目仅为CLS标记，解析时应忽略。 | | `drug` | `string` | 处理剂名称。DMSO_TF代表载体对照，可结合`plate`字段获取板匹配的对照样本。 | | `sample` | `string` | 细胞来源样本的唯一标识符，可用于合并`sample_metadata`表中的信息，区分重复处理组。 | | `BARCODE_SUB_LIB_ID`| `string` | 条形码与亚文库标识符的组合，数据集中每个细胞的该值均唯一，可作为引用`obs_metadata`表的索引键。 | | `cell_line_id` | `string` | 细胞来源癌细胞系的唯一标识符。本数据集原使用Cellosaurus ID，`cell_line_metadata`表中提供了DepMap ID等额外标识符。 | | `moa-fine` | `string` | 药物的精细作用机制（Mechanism of Action, MOA）注释，指明受影响的生物学过程或分子靶点。数据源自MedChemExpress，并经基于GPT的注释整理。 | | `canonical_smiles` | `string` | 表示扰动化合物分子结构的标准化简化分子线性输入系统（Simplified Molecular Input Line Entry System, SMILES）字符串。 | | `pubchem_cid` | `string` | 药物的PubChem化合物标识符，可用于跨公共化学数据库交叉引用。DMSO对照使用空字符串，查询前请转换为整数类型。 | | `plate` | `string` | 96孔板（编号1-14）的标识符，混合细胞球体在此板中接种并接受处理。 | ## 附加元数据 ### 基因元数据 python gene_metadata = load_dataset("taheobio/Tahoe-100M","gene_metadata", split="train") | **列名** | **描述** | |---------------|-------------------------------------------------------------------------------------------------------------| | `gene_symbol` | 对应基因的HGNC批准基因符号（例如*TP53*、*BRCA1*）。 | | `ensembl_id` | 基于Ensembl发布109和基因组版本38的Ensembl基因标识符（例如*ENSG00000000003*)。 | | `token_id` | 用于表示每个基因的整数Token ID，即主数据中`genes`字段使用的ID。 | ### 样本元数据 python sample_metadata = load_dataset("tahoebio/Tahoe-100M","sample_metadata", split="train") 本样本元数据表包含样本的聚合质量指标与浓度信息。 | **列名** | **描述** | |------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `sample` | 细胞来源样本的唯一标识符，为本表的唯一键。 | | `plate` | 对应样本的96孔板标识符（1-14号）。 | | `mean_gene_count` | 给定样本中每个细胞检测到的独特基因的平均数量。 | | `mean_tscp_count` | 给定样本中每个细胞检测到的转录本（UMIs）的平均数量。 | | `mean_mread_count` | 每个细胞的平均读取数。 | | `mean_pcnt_mito` | 样本中所有细胞的线粒体基因映射读取数占总读取数的平均百分比。 | | `drug` | 用于扰动样本中细胞的处理剂名称。 | | `drugname_drugconc` | 结合化合物名称、浓度与浓度单位的字符串（例如`[('8-Hydroxyquinoline',0.05,'uM')]`），用于唯一标记每个处理条件。 | ### 药物元数据 python drug_metadata = load_dataset("tahoebio/Tahoe-100M","drug_metadata", split="train") 本药物元数据表包含每个处理剂的额外信息。 | **列名** | **描述** | |------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `drug` | 用于扰动样本中细胞的处理剂名称，为本表的唯一键。 | | `targets` | 代表化合物已知分子靶点的基因符号列表。靶点由GPT-4o基于化合物名称提出，并经MedChemExpress信息验证。 | | `moa-broad` | 化合物作用机制（MOA）的宽泛分类，通常分为“抑制剂/拮抗剂”、“激活剂/激动剂”或“未明确”。由GPT-4o基于化合物靶点数据与MedChemExpress的整理描述推断得出。 | | `moa-fine` | 化合物MOA的具体功能注释（例如“蛋白酶体抑制剂”或“MEK抑制剂”）。这些精细标签选自25个经过整理的MOA类别列表，由GPT-4o分配并经化合物描述验证。 | | `human-approved` | 指示化合物是否获批用于人类使用（“yes”或“no”）。由GPT-4o基于先验知识并经clinicaltrials.gov等公共来源验证得出。 | | `clinical-trials` | 指示化合物是否已在任何注册临床试验中进行评估（“yes”或“no”）。由GPT-4o推断并经clinicaltrials.gov搜索佐证。 | | `gpt-notes-approval` | GPT-4o生成的上下文注释，总结化合物的获批状态、常见临床用途或特定剂型获批等细节。 | | `canonical_smiles` | 化合物的SMILES（简化分子线性输入系统）表示，以文本字符串形式捕获其分子结构。 | | `pubchem_cid` | 化合物的PubChem化合物标识符（CID），是将化合物链接至PubChem数据库条目的唯一数值ID。 | ### 细胞系元数据 python cell_line_metadata = load_dataset("tahoebio/Tahoe-100M","cell_line_metadata", split="train") 本细胞系元数据表包含每个细胞系的关键驱动突变信息。 | **列名** | **描述** | |----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cell_name` | 癌细胞系的标准名称（例如*A549*）。 | | `Cell_ID_DepMap` | DepMap项目中细胞系的唯一标识符（例如*ACH-000681*）。 | | `Cell_ID_Cellosaur` | Cellosaurus收录ID（例如*CVCL_0023*），即本数据集中使用的ID。 | | `Organ` | 细胞系的组织或器官来源（例如*肺*），用于解析谱系特异性响应与生物学背景。 | | `Driver_Gene_Symbol` | 本细胞系中已知或推定的驱动基因的HGNC批准符号（例如*KRAS*、*CDKN2A*）。我们提供了每个细胞系的整理后的驱动突变列表。 | | `Driver_VarZyg` | 驱动变异的合子型（例如*Hom*代表纯合子，*Het*代表杂合子） | | `Driver_VarType` | 遗传改变的类型（例如*错义突变*、*移码突变*、*无义突变*、*缺失*) | | `Driver_ProtEffect_or_CdnaEffect`| 突变的具体蛋白质或cDNA水平注释（例如*p.G12S*、*p.Q37*），提供变异后果的精确信息。 | | `Driver_Mech_InferDM` | 推断的突变功能机制（例如*LoF*代表功能丧失，*GoF*代表功能获得) | | `Driver_GeneType_DM` | 驱动基因的分类，分为*致癌基因*或*抑癌基因* | ## 引用请引用以下文献： @article{zhang2025tahoe, title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling}, author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} }

应用场景：