下载链接：

https://modelscope.cn/datasets/tahoebio/Tahoe-100M

下载链接

链接失效反馈

官方服务：

资源简介：

# Tahoe-100M Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This dataset is designed to power the development of next-generation AI models of cell biology, offering broad applications across systems biology, drug discovery, and precision medicine. **Preprint**: [Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling](https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1) <img src="https://pbs.twimg.com/media/Gkpp8RObkAM-fxe?format=jpg&name=4096x4096" width="1024" height="1024"> ## Quickstart ```python from datasets import load_dataset # Load dataset in streaming mode ds = load_dataset("tahoebio/Tahoe-100m", streaming=True, split="train") # View the first record next(ds.iter(1)) ``` ### Tutorials Please refer to our tutorials for examples on using the data, accessing metadata tables and converting to/from the anndata format. Please see the [Data Loading Tutorial](tutorials/loading_data.ipynb) for a walkthrough on using the data. <table> <thead> <tr> <th>Notebook</th> <th>URL</th> <th>Colab</th> </tr> </thead> <tbody> <tr> <td>Loading the dataset from huggingface, accessing metadata, mapping to anndata</td> <td> <a href="https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank"> Link </a> </td> <td> <a href="https://colab.research.google.com/#fileId=https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> </a> </td> </tr> </tbody> </table> ### Community Resources Here are a links to few resources created by the community. We would love to feature additional tutorials from the community, if you have built something on top of Tahoe-100M, please let us know and we would love to feature your work. <table> <thead> <tr> <th>Resource</th> <th>Contributor</th> <th>URL</th> </tr> </thead> <tbody> <tr> <td>Analysis guide for Tahoe-100M using rapids-single-cell, scanpy and dask</td> <td><a href="https://github.com/scverse" target="_blank">SCVERSE</a></td> <td><a href="https://github.com/theislab/vevo_Tahoe_100m_analysis/tree/tahoe-DGX-fix" target="_blank">Link</a></td> </tr> <tr> <td>Tutorial for accessing Tahoe-100M h5ad files hosted by the Arc Institute</td> <td><a href="https://github.com/ArcInstitute" target="_blank">Arc Institute</a></td> <td><a href="https://github.com/ArcInstitute/arc-virtual-cell-atlas/blob/main/tahoe-100M/tutorial-py.ipynb" target="_blank">Link</a></td> </tr> </tbody> </table> ## Dataset Features We provide multiple tables with the dataset including the main data (raw counts) in the `expression_data` table as well as various metadata in the `gene_metadata`,`sample_metadata`,`drug_metadata`,`cell_line_metadata`,`obs_metadata` tables. The main data can be downloaded as follows: ```python from datasets import load_dataset tahoe_100m_ds = load_dataset("tahoebio/Tahoe-100M", streaming=True, split="train") ``` Setting `stream=True` instantiates an `IterableDataset` and prevents needing to download the full dataset first. See [tutorial](tutorials/loading_data.ipynb) for an end-to-end example. The expression_data table has the following fields: | **Field Name** | **Type** | **Description** | |------------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `genes` | `sequence<int64>` | Gene identifiers (integer token IDs) corresponding to each gene with non-zero expression in the cell. This sequence aligns with the `expressions` field. The gene_metadata table can be used to map the token_IDs to gene_symbols or ensembl_IDs. The first entry for each row is just a marker token and should be ignored (See [data-loading tutorial](tutorials/loading_data.ipynb)) | | `expressions` | `sequence<float32>` | Raw count values for each gene, aligned with the `genes` field. The first entry just marks a CLS token and should be ignored when parsing. | | `drug` | `string` | Name of the treatment. DMSO_TF marks vehicle controls, use DMSO_TF along with plate to get plate matched controls. | | `sample` | `string` | Unique identifier for the sample from which the cell was derived. Can be used to merge information from the `sample_metadata` table. Distinguishes replicate treatments. | | `BARCODE_SUB_LIB_ID`| `string` | Combination of barcode and sublibary identifiers. Unique for each cell in the dataset. Can be used as an index key when referencing to the `obs_metadata` table. | | `cell_line_id` | `string` | Unique identifier for the cancer cell line from which the cell originated. We use Cellosaurus IDs were, but additional identifiers such as DepMap IDs are provided in the `cell_line_metadata` table. | | `moa-fine` | `string` | Fine-grained mechanism of action (MOA) annotation for the drug, specifying the biological process or molecular target affected. Derived from MedChemExpress and curated with GPT-based annotations. | | `canonical_smiles` | `string` | Canonical SMILES (Simplified Molecular Input Line Entry System) string representing the molecular structure of the perturbing compound. | | `pubchem_cid` | `string` | PubChem Compound Identifier for the drug, allowing cross-referencing with public chemical databases. An empty string is used for DMSO controls. Please cast to int before querrying pubchem. | | `plate` | `string` | Identifier for the 96-well plate (1–14) in which the mixed-cell spheroid was seeded and treated. | ## Additional metadata ### Gene Metadata ```python gene_metadata = load_dataset("taheobio/Tahoe-100M","gene_metadata", split="train") ``` | Column Name | Description | |---------------|-------------------------------------------------------------------------------------------------------------| | `gene_symbol` | The HGNC-approved gene symbol corresponding to each gene (e.g., *TP53*, *BRCA1*). | | `ensembl_id` | The Ensembl gene identifier (e.g., *ENSG00000000003*) based on Ensembl release 109 and genome build 38. | | `token_id` | An integer token ID used to represent each gene. This is the ID used in the `genes` field in the main data. | ### Sample Metadata ```python sample_metadata = load_dataset("tahoebio/Tahoe-100M","sample_metadata", split="train") ``` The sample_metadata has additional information for aggregate quality metrics for the sample as well as the concentration. | Column Name | Description | |------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `sample` | Unique identifier for the sample from which the cell was derived. Unique key for this table. | | `plate` | Identifier (1–14) for the 96-well plate for the sample | | `mean_gene_count` | Average number of unique genes detected per cell for the given sample. | | `mean_tscp_count` | Average number of transcripts (UMIs) detected per cell in the sample. | | `mean_mread_count` | Average number of reads per cell. | | `mean_pcnt_mito` | Mean percentage of total reads that map to mitochondrial genes, across cells in the sample. | | `drug` | Name of the treatment used to perturb the cells in the sample. | | `drugname_drugconc` | String combining the compound name, concentration and concentration unit (e.g., `[('8-Hydroxyquinoline',0.05,'uM')]`), used to uniquely label each treatment condition. | ### Drug Metadata ```python drug_metadata = load_dataset("tahoebio/Tahoe-100M","drug_metadata", split="train") ``` The drug_metadata has additional information about each treatment. | Column Name | Description | |------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `drug` | Name of the treatment used to perturb the cells in the sample. Unique key for this table | | `targets` | List of gene symbols representing the known molecular targets of the compound. Targets were proposed by GPT-4o based on compound names and then validated against MedChemExpress information. | | `moa-broad` | Broad classification of the compound’s mechanism of action (MOA), typically categorized as "inhibitor/antagonist," "activator/agonist," or "unclear." GPT-4o inferred this using compound target data and curated descriptions from MedChemExpress. | | `moa-fine` | Specific functional annotation of the compound's MOA (e.g., "Proteasome inhibitor" or "MEK inhibitor"). These fine-grained labels were selected from a curated list of 25 MOA categories and assigned by GPT-4o with validation against compound descriptions. | | `human-approved` | Indicates whether the compound is approved for human use ("yes" or "no"). GPT-4o provided these labels using prior knowledge and validation from public sources such as clinicaltrials.gov. | | `clinical-trials` | Indicates whether the compound has been evaluated in any registered clinical trials ("yes" or "no"). Determined using GPT-4o and corroborated using clinicaltrials.gov searches. | | `gpt-notes-approval` | Contextual notes generated by GPT-4o summarizing the compound’s approval status, common clinical usage, or nuances such as formulation-specific approvals. | | `canonical_smiles` | The compound's SMILES (Simplified Molecular Input Line Entry System) representation, capturing its molecular structure as a text string. | | `pubchem_cid` | The PubChem Compound Identifier (CID), a unique numerical ID linking the compound to its entry in the PubChem database. | ### Cell Line Metadata ```python cell_line_metadata = load_dataset("tahoebio/Tahoe-100M","cell_line_metadata", split="train") ``` The cell-line metadata table has additional information about the key driver mutations for each cell line. | Column Name | Description | |----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cell_name` | Standard name of the cancer cell line (e.g., *A549*). | | `Cell_ID_DepMap` | Unique identifier for the cell line in the DepMap project (e.g., *ACH-000681*) | | `Cell_ID_Cellosaur` | Cellosaurus accession ID (e.g., *CVCL_0023*). This is the ID used in the main dataset. | | `Organ` | Tissue or organ of origin for the cell line (e.g., *Lung*), used to interpret lineage-specific responses and biological context. | | `Driver_Gene_Symbol` | HGNC-approved symbol of a known or putative driver gene with functional alterations in this cell line (e.g., *KRAS*, *CDKN2A*). We report a curated list of driver mutations per cell-line. | | `Driver_VarZyg` | Zygosity of the driver variant (e.g., *Hom* for homozygous, *Het* for heterozygous) | | `Driver_VarType` | Type of genetic alteration (e.g., *Missense*, *Frameshift*, *Stopgain*, *Deletion*) | | `Driver_ProtEffect_or_CdnaEffect`| Specific protein or cDNA-level annotation of the mutation (e.g., *p.G12S*, *p.Q37*), providing precise information on the variant’s consequence. | | `Driver_Mech_InferDM` | Inferred functional mechanism of the mutation (e.g., *LoF* for loss-of-function, *GoF* for gain-of-function) | | `Driver_GeneType_DM` | Classification of the driver gene as an *Oncogene* or *Suppressor* | ## Citation Please cite: ``` @article{zhang2025tahoe, title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling}, author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} } ```

# Tahoe-100M Tahoe-100M 是一个十亿级规模的单细胞扰动图谱（single-cell perturbation atlas），包含来自50种癌细胞系、经1100种小分子扰动处理后的超过1亿条转录组谱。该数据集依托Vevo Therapeutics的Mosaic高通量平台构建，能够以前所未有的规模与分辨率，实现对基因功能、细胞状态及药物反应的深度情境感知探索。本数据集旨在推动下一代细胞生物学人工智能模型的研发，可广泛应用于系统生物学、药物发现与精准医学等领域。 **预印本**：[Tahoe-100M: 面向情境依赖基因功能与细胞建模的十亿级单细胞扰动图谱](https://www.biorxiv.org/content/10.1101/2025.02.20.639398v1) <img src="https://pbs.twimg.com/media/Gkpp8RObkAM-fxe?format=jpg&name=4096x4096" width="1024" height="1024"> ## 快速入门 python from datasets import load_dataset # 以流式加载模式加载数据集 ds = load_dataset("tahoebio/Tahoe-100m", streaming=True, split="train") # 查看第一条数据记录 next(ds.iter(1)) ### 教程请参阅我们的教程以了解数据使用、元数据表访问以及与anndata格式互转的示例。如需了解数据使用的完整流程，请参阅[数据加载教程](tutorials/loading_data.ipynb)。 <table> <thead> <tr> <th>教程文档</th> <th>链接地址</th> <th>Colab运行地址</th> </tr> </thead> <tbody> <tr> <td>从Hugging Face加载数据集、访问元数据并映射至anndata格式</td> <td> <a href="https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank"> 链接 </a> </td> <td> <a href="https://colab.research.google.com/#fileId=https://huggingface.co/datasets/tahoebio/Tahoe-100M/blob/main/tutorials/loading_data.ipynb" target="_blank"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/> </a> </td> </tr> </tbody> </table> ### 社区资源以下为社区贡献的部分资源。我们欢迎展示更多基于Tahoe-100M开发的社区教程，若您有相关成果，欢迎联系我们进行展示。 <table> <thead> <tr> <th>资源内容</th> <th>贡献方</th> <th>链接地址</th> </tr> </thead> <tbody> <tr> <td>使用rapids-single-cell、scanpy与dask进行Tahoe-100M分析的指南</td> <td><a href="https://github.com/scverse" target="_blank">SCVERSE</a></td> <td><a href="https://github.com/theislab/vevo_Tahoe_100m_analysis/tree/tahoe-DGX-fix" target="_blank">链接</a></td> </tr> <tr> <td>访问Arc Institute托管的Tahoe-100M h5ad文件的教程</td> <td><a href="https://github.com/ArcInstitute" target="_blank">Arc Institute</a></td> <td><a href="https://github.com/ArcInstitute/arc-virtual-cell-atlas/blob/main/tahoe-100M/tutorial-py.ipynb" target="_blank">链接</a></td> </tr> </tbody> </table> ## 数据集特征本数据集提供多张数据表，包括存储原始计数的主数据表`expression_data`，以及`gene_metadata`、`sample_metadata`、`drug_metadata`、`cell_line_metadata`、`obs_metadata`等元数据表。主数据可通过以下方式下载： python from datasets import load_dataset tahoe_100m_ds = load_dataset("tahoebio/Tahoe-100M", streaming=True, split="train") 设置`stream=True`将实例化一个`IterableDataset`（可迭代数据集），无需预先下载完整数据集。如需完整示例，请参阅[教程](tutorials/loading_data.ipynb)。 `expression_data`表包含以下字段： | **字段名** | **数据类型** | **说明** | |------------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `genes` | `sequence<int64>` | 基因标识符（整数Token ID），对应细胞中表达量非零的基因。该序列与`expressions`字段对齐。可通过`gene_metadata`表将Token ID映射为基因符号或Ensembl ID。每行的首个条目仅为标记Token，需忽略（详见[数据加载教程](tutorials/loading_data.ipynb)） | | `expressions` | `sequence<float32>` | 各基因的原始计数，与`genes`字段对齐。解析时需忽略首个条目，该条目仅用于标记CLS Token。 | | `drug` | `string` | 处理剂名称。`DMSO_TF`代表溶剂对照，可结合`plate`字段获取板匹配的对照组。 | | `sample` | `string` | 细胞来源样本的唯一标识符，可用于关联`sample_metadata`表中的信息，区分重复处理组。 | | `BARCODE_SUB_LIB_ID`| `string` | 条形码与子文库标识符的组合，为数据集中每个细胞的唯一标识，可作为索引键关联`obs_metadata`表。 | | `cell_line_id` | `string` | 细胞来源癌细胞系的唯一标识符。本数据集使用Cellosaurus ID，同时`cell_line_metadata`表中提供了DepMap ID等额外标识符。 | | `moa-fine` | `string` | 药物的精细作用机制（MOA）注释，明确标注其影响的生物学过程或分子靶点。数据源自MedChemExpress，并经基于GPT的注释进行人工校验。 | | `canonical_smiles` | `string` | 表示扰动化合物分子结构的标准化简化分子线性输入系统（Simplified Molecular Input Line Entry System，简称SMILES）字符串。 | | `pubchem_cid` | `string` | 药物的PubChem化合物标识符，可用于跨公共化学数据库交叉引用。DMSO对照组使用空字符串。查询PubChem前请先将其转换为整数类型。 | | `plate` | `string` | 96孔板（编号1–14）的标识符，混合细胞球体在此板中接种并接受处理。 | ## 额外元数据 ### 基因元数据 python gene_metadata = load_dataset("taheobio/Tahoe-100M","gene_metadata", split="train") | 列名 | 说明 | |---------------|-------------------------------------------------------------------------------------------------------------| | `gene_symbol` | 对应基因的HGNC官方批准基因符号（例如*TP53*、*BRCA1*）。 | | `ensembl_id` | 基于Ensembl版本109和基因组版本38的Ensembl基因标识符（例如*ENSG00000000003*）。 | | `token_id` | 用于表示每个基因的整数Token ID，即主数据中`genes`字段使用的ID。 | ### 样本元数据 python sample_metadata = load_dataset("tahoebio/Tahoe-100M","sample_metadata", split="train") `sample_metadata`表包含样本的聚合质量指标与浓度信息。 | 列名 | 说明 | |------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `sample` | 细胞来源样本的唯一标识符，为本表的唯一键。 | | `plate` | 对应样本的96孔板标识符（1–14） | | `mean_gene_count` | 给定样本中每个细胞检测到的独特基因的平均数量。 | | `mean_tscp_count` | 样本中每个细胞检测到的唯一分子标识符（Unique Molecular Identifiers，简称UMIs）的平均数量。 | | `mean_mread_count` | 每个细胞的平均读取数。 | | `mean_pcnt_mito` | 样本中所有细胞的线粒体基因映射读取数占总读取数的平均百分比。 | | `drug` | 用于扰动样本中细胞的处理剂名称。 | | `drugname_drugconc` | 结合化合物名称、浓度与浓度单位的字符串（例如`[('8-Hydroxyquinoline',0.05,'uM')]`），用于唯一标记每个处理条件。 | ### 药物元数据 python drug_metadata = load_dataset("tahoebio/Tahoe-100M","drug_metadata", split="train") `drug_metadata`表包含每种处理剂的额外信息。 | 列名 | 说明 | |------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `drug` | 用于扰动样本中细胞的处理剂名称，为本表的唯一键 | | `targets` | 代表化合物已知分子靶点的基因符号列表。靶点由GPT-4o基于化合物名称提出，后经MedChemExpress信息验证。 | | `moa-broad` | 化合物作用机制（MOA）的宽泛分类，通常分为“抑制剂/拮抗剂”“激活剂/激动剂”或“未明确”。由GPT-4o基于化合物靶点数据与MedChemExpress的人工 curated 描述推断。 | | `moa-fine` | 化合物MOA的具体功能注释（例如“蛋白酶体抑制剂”或“MEK抑制剂”）。这些精细标签选自25个人工 curated MOA类别，由GPT-4o分配并经化合物描述验证。 | | `human-approved` | 指示化合物是否获批用于人类使用（“yes”或“no”）。由GPT-4o基于先验知识并经clinicaltrials.gov等公共来源验证。 | | `clinical-trials` | 指示化合物是否已在任何注册临床试验中进行评估（“yes”或“no”）。由GPT-4o确定，并经clinicaltrials.gov搜索佐证。 | | `gpt-notes-approval` | GPT-4o生成的上下文注释，总结化合物的获批状态、常见临床用途或剂型特异性获批等细节。 | | `canonical_smiles` | 化合物的SMILES表示，以文本字符串形式捕获其分子结构。 | | `pubchem_cid` | PubChem化合物标识符（CID），是将化合物链接到PubChem数据库条目的唯一数值ID。 | ### 细胞系元数据 python cell_line_metadata = load_dataset("tahoebio/Tahoe-100M","cell_line_metadata", split="train") `cell-line`元数据表包含每种细胞系的关键驱动突变信息。 | 列名 | 说明 | |----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cell_name` | 癌细胞系的标准名称（例如*A549*）。 | | `Cell_ID_DepMap` | DepMap项目中细胞系的唯一标识符（例如*ACH-000681*） | | `Cell_ID_Cellosaur` | Cellosaurus登录ID（例如*CVCL_0023*），即主数据集中使用的ID。 | | `Organ` | 细胞系的组织或器官来源（例如*肺*），用于解析谱系特异性反应与生物学背景。 | | `Driver_Gene_Symbol` | 该细胞系中存在功能改变的已知或推定驱动基因的HGNC批准符号（例如*KRAS*、*CDKN2A*）。我们提供了每个细胞系的人工 curated 驱动突变列表。 | | `Driver_VarZyg` | 驱动变异的合子性（例如*Hom*代表纯合子，*Het*代表杂合子） | | `Driver_VarType` | 遗传改变的类型（例如*错义突变*、*移码突变*、*无义突变*、*缺失*） | | `Driver_ProtEffect_or_CdnaEffect`| 突变的具体蛋白质或cDNA水平注释（例如*p.G12S*、*p.Q37*），提供变异后果的精确信息。 | | `Driver_Mech_InferDM` | 推断的突变功能机制（例如*LoF*代表功能丧失，*GoF*代表功能获得） | | `Driver_GeneType_DM` | 驱动基因的分类，分为*致癌基因*或*抑癌基因* | ## 引用请引用： @article{zhang2025tahoe, title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling}, author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} }

应用场景：