tahoebio/Tahoe-x1-embeddings

Name: tahoebio/Tahoe-x1-embeddings
Creator: tahoebio
Published: 2025-12-03 20:52:09
License: 暂无描述

Hugging Face2025-12-03 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/tahoebio/Tahoe-x1-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - feature-extraction tags: - biology - single-cell - transcriptomics - embeddings - drug-discovery - cancer - perturbation - foundation-model size_categories: - 10M<n<100M language: - en pretty_name: Tahoe-x1 Embeddings on Tahoe-100M configs: - config_name: default data_files: - split: train path: data/*.parquet --- # Tahoe-x1 Embeddings on Tahoe-100M Precomputed embeddings from the [Tahoe-x1](https://huggingface.co/tahoebio/Tahoe-x1) foundation model applied to the [Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M) dataset. This dataset provides high-dimensional representations of single-cell transcriptomic profiles from cancer cell lines under small-molecule perturbations. ## Overview This dataset contains cell embeddings generated using the **Tahoe-x1-3B** model, a 3 billion parameter perturbation-trained single-cell foundation model. The embeddings capture cellular states across: - **50 cancer cell lines** spanning multiple tissue types - **~1,100 small-molecule compounds** with diverse mechanisms of action - **100+ million single-cell profiles** from the original [Tahoe-100M dataset](https://huggingface.co/datasets/tahoebio/Tahoe-100M) These embeddings enable downstream applications such as drug response prediction, cell state classification, and perturbation effect analysis without requiring re-computation from raw expression data. For detailed information about the model architecture and training, see the [Tahoe-x1 model card](https://huggingface.co/tahoebio/Tahoe-x1). For information about the source data, see the [Tahoe-100M dataset card](https://huggingface.co/datasets/tahoebio/Tahoe-100M). ## Dataset Structure Each row in the dataset represents a single-cell profile with its corresponding embedding: | Column | Type | Description | |--------|------|-------------| | `drug` | `string` | Drug compound name (e.g., "8-Hydroxyquinoline") | | `sample` | `string` | Sample identifier from Tahoe-100M (e.g., "smp_1783") | | `cell_line` | `string` | Cellosaurus cell line identifier (e.g., "CVCL_1717", "CVCL_0480") | | `BARCODE_SUB_LIB_ID` | `string` | Unique barcode identifier for the sub-library (19 characters) | | `mosaicfm-3b-prod-cont-MFMv2` | `list[float]` | Cell embedding vector from Tahoe-x1-3B | **Note**: The embedding column name reflects the internal model version used during generation. Data files are stored in the `data/` directory in Parquet format for efficient streaming and loading. ## Quickstart ```python from datasets import load_dataset # Stream the dataset without downloading ds = load_dataset("tahoebio/Tahoe-x1-embeddings", streaming=True, split="train") # Get first example example = next(iter(ds)) print(example) ``` **Note**: If you encounter schema parsing errors, use this alternative: ```python from datasets import load_dataset # Load using parquet directly ds = load_dataset( "parquet", data_files="hf://datasets/tahoebio/Tahoe-x1-embeddings/data/*.parquet", streaming=True, split="train" ) ``` ## Source Information Embeddings generated using the [Tahoe-x1-3B](https://huggingface.co/tahoebio/Tahoe-x1) model on the [Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M) dataset. ## Linking to Tahoe-100M Metadata To enrich these embeddings with additional metadata from Tahoe-100M: ```python from datasets import load_dataset # Load embeddings embeddings = load_dataset("tahoebio/Tahoe-x1-embeddings", split="train") # Load drug metadata drug_metadata = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train") # Load cell line metadata cell_line_metadata = load_dataset("tahoebio/Tahoe-100M", "cell_line_metadata", split="train") # Merge using pandas import pandas as pd df_emb = embeddings.to_pandas() df_drugs = drug_metadata.to_pandas() df_cells = cell_line_metadata.to_pandas() # Join on drug name and cell line df_enriched = df_emb.merge(df_drugs, on='drug').merge(df_cells, on='cell_line') print(f"Enriched dataset shape: {df_enriched.shape}") ``` ## License Apache 2.0 (inherited from Tahoe-x1 model) ## Resources - 🤗 [Tahoe-x1 Model Card](https://huggingface.co/tahoebio/Tahoe-x1) - 🤗 [Tahoe-100M Dataset Card](https://huggingface.co/datasets/tahoebio/Tahoe-100M) - 🚀 [Tahoe-x1 Interactive Demo](https://huggingface.co/spaces/tahoebio/Tahoe-x1) - 📧 Contact: admin@tahoebio.ai ## Acknowledgments This dataset builds upon the foundational work of Tahoe Therapeutics and Vevo Therapeutics in creating large-scale single-cell perturbation atlases and state-of-the-art foundation models for cellular biology.

许可证：Apache-2.0 任务类别： - 特征提取（feature-extraction）标签： - 生物学（biology） - 单细胞（single-cell） - 转录组学（transcriptomics） - 嵌入向量（embeddings） - 药物发现（drug-discovery） - 癌症（cancer） - 扰动（perturbation） - 基础模型（foundation-model）数据规模： - 10M < 样本数 < 100M 语言： - 英语展示名称：Tahoe-x1嵌入向量 on Tahoe-100M 配置： - 配置名称：default 数据文件： - 拆分方式：train 路径：data/*.parquet # Tahoe-x1 嵌入向量 on Tahoe-100M 本数据集为将[Tahoe-x1基础模型（foundation model）](https://huggingface.co/tahoebio/Tahoe-x1)生成的预计算嵌入向量应用于[Tahoe-100M数据集](https://huggingface.co/datasets/tahoebio/Tahoe-100M)所得的产物，提供了经小分子扰动处理的癌细胞系单细胞转录组谱的高维表征。 ## 概览本数据集包含使用**Tahoe-x1-3B**模型生成的细胞嵌入向量，该模型是一款经过扰动训练的30亿参数单细胞基础模型。该嵌入向量可捕捉以下维度的细胞状态： - 覆盖多种组织类型的**50株癌细胞系** - 约1100种具有多样作用机制的**小分子化合物** - 源自原始[Tahoe-100M数据集](https://huggingface.co/datasets/tahoebio/Tahoe-100M)的**1亿以上单细胞谱** 这些嵌入向量可直接用于下游任务，如药物反应预测、细胞状态分类及扰动效应分析，无需从原始表达数据重新计算生成。如需了解模型架构与训练的详细信息，请参阅[Tahoe-x1模型卡片](https://huggingface.co/tahoebio/Tahoe-x1)；如需了解源数据的相关信息，请参阅[Tahoe-100M数据集卡片](https://huggingface.co/datasets/tahoebio/Tahoe-100M)。 ## 数据集结构本数据集的每一行对应一条单细胞谱及其对应的嵌入向量： | 列名 | 数据类型 | 描述 | |------|----------|------| | `drug` | `string` | 药物化合物名称（例如："8-Hydroxyquinoline"） | | `sample` | `string` | 源自Tahoe-100M的样本标识符（例如："smp_1783"） | | `cell_line` | `string` | Cellosaurus细胞系标识符（例如："CVCL_1717"、"CVCL_0480"） | | `BARCODE_SUB_LIB_ID` | `string` | 子文库的唯一条形码标识符（共19个字符） | | `mosaicfm-3b-prod-cont-MFMv2` | `list[float]` | 来自Tahoe-x1-3B的细胞嵌入向量 | **注意**：嵌入向量列的名称反映了生成过程中使用的内部模型版本。数据文件以Parquet格式存储于`data/`目录中，可高效流式加载与读取。 ## 快速入门 python from datasets import load_dataset # 流式加载数据集，无需提前下载 ds = load_dataset("tahoebio/Tahoe-x1-embeddings", streaming=True, split="train") # 获取第一条数据 example = next(iter(ds)) print(example) **注意**：若遇到架构解析错误，请使用以下替代方案： python from datasets import load_dataset # 直接加载Parquet文件 ds = load_dataset( "parquet", data_files="hf://datasets/tahoebio/Tahoe-x1-embeddings/data/*.parquet", streaming=True, split="train" ) ## 源信息嵌入向量由[Tahoe-x1-3B模型](https://huggingface.co/tahoebio/Tahoe-x1)在[Tahoe-100M数据集](https://huggingface.co/datasets/tahoebio/Tahoe-100M)上生成。 ## 关联Tahoe-100M元数据如需为这些嵌入向量补充来自Tahoe-100M的额外元数据，请执行以下操作： python from datasets import load_dataset # 加载嵌入向量 embeddings = load_dataset("tahoebio/Tahoe-x1-embeddings", split="train") # 加载药物元数据 drug_metadata = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train") # 加载细胞系元数据 cell_line_metadata = load_dataset("tahoebio/Tahoe-100M", "cell_line_metadata", split="train") # 使用pandas进行合并 import pandas as pd df_emb = embeddings.to_pandas() df_drugs = drug_metadata.to_pandas() df_cells = cell_line_metadata.to_pandas() # 基于药物名称与细胞系进行连接 df_enriched = df_emb.merge(df_drugs, on='drug').merge(df_cells, on='cell_line') print(f"富集后的数据集形状：{df_enriched.shape}") ## 许可证 Apache 2.0（继承自Tahoe-x1模型） ## 资源 - 🤗 [Tahoe-x1模型卡片](https://huggingface.co/tahoebio/Tahoe-x1) - 🤗 [Tahoe-100M数据集卡片](https://huggingface.co/datasets/tahoebio/Tahoe-100M) - 🚀 [Tahoe-x1交互式演示空间](https://huggingface.co/spaces/tahoebio/Tahoe-x1) - 📧 联系方式：admin@tahoebio.ai ## 致谢本数据集基于Tahoe Therapeutics与Vevo Therapeutics的开创性工作，二者构建了大规模单细胞扰动图谱以及用于细胞生物学的前沿基础模型。

提供机构：

tahoebio

5,000+

优质数据集

54 个

任务类型

进入经典数据集