tahoebio/Tahoe-x1-embeddings
收藏Hugging Face2025-12-03 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tahoebio/Tahoe-x1-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- feature-extraction
tags:
- biology
- single-cell
- transcriptomics
- embeddings
- drug-discovery
- cancer
- perturbation
- foundation-model
size_categories:
- 10M<n<100M
language:
- en
pretty_name: Tahoe-x1 Embeddings on Tahoe-100M
configs:
- config_name: default
data_files:
- split: train
path: data/*.parquet
---
# Tahoe-x1 Embeddings on Tahoe-100M
Precomputed embeddings from the [Tahoe-x1](https://huggingface.co/tahoebio/Tahoe-x1) foundation model applied to the [Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M) dataset. This dataset provides high-dimensional representations of single-cell transcriptomic profiles from cancer cell lines under small-molecule perturbations.
## Overview
This dataset contains cell embeddings generated using the **Tahoe-x1-3B** model, a 3 billion parameter perturbation-trained single-cell foundation model. The embeddings capture cellular states across:
- **50 cancer cell lines** spanning multiple tissue types
- **~1,100 small-molecule compounds** with diverse mechanisms of action
- **100+ million single-cell profiles** from the original [Tahoe-100M dataset](https://huggingface.co/datasets/tahoebio/Tahoe-100M)
These embeddings enable downstream applications such as drug response prediction, cell state classification, and perturbation effect analysis without requiring re-computation from raw expression data.
For detailed information about the model architecture and training, see the [Tahoe-x1 model card](https://huggingface.co/tahoebio/Tahoe-x1). For information about the source data, see the [Tahoe-100M dataset card](https://huggingface.co/datasets/tahoebio/Tahoe-100M).
## Dataset Structure
Each row in the dataset represents a single-cell profile with its corresponding embedding:
| Column | Type | Description |
|--------|------|-------------|
| `drug` | `string` | Drug compound name (e.g., "8-Hydroxyquinoline") |
| `sample` | `string` | Sample identifier from Tahoe-100M (e.g., "smp_1783") |
| `cell_line` | `string` | Cellosaurus cell line identifier (e.g., "CVCL_1717", "CVCL_0480") |
| `BARCODE_SUB_LIB_ID` | `string` | Unique barcode identifier for the sub-library (19 characters) |
| `mosaicfm-3b-prod-cont-MFMv2` | `list[float]` | Cell embedding vector from Tahoe-x1-3B |
**Note**: The embedding column name reflects the internal model version used during generation.
Data files are stored in the `data/` directory in Parquet format for efficient streaming and loading.
## Quickstart
```python
from datasets import load_dataset
# Stream the dataset without downloading
ds = load_dataset("tahoebio/Tahoe-x1-embeddings", streaming=True, split="train")
# Get first example
example = next(iter(ds))
print(example)
```
**Note**: If you encounter schema parsing errors, use this alternative:
```python
from datasets import load_dataset
# Load using parquet directly
ds = load_dataset(
"parquet",
data_files="hf://datasets/tahoebio/Tahoe-x1-embeddings/data/*.parquet",
streaming=True,
split="train"
)
```
## Source Information
Embeddings generated using the [Tahoe-x1-3B](https://huggingface.co/tahoebio/Tahoe-x1) model on the [Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M) dataset.
## Linking to Tahoe-100M Metadata
To enrich these embeddings with additional metadata from Tahoe-100M:
```python
from datasets import load_dataset
# Load embeddings
embeddings = load_dataset("tahoebio/Tahoe-x1-embeddings", split="train")
# Load drug metadata
drug_metadata = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")
# Load cell line metadata
cell_line_metadata = load_dataset("tahoebio/Tahoe-100M", "cell_line_metadata", split="train")
# Merge using pandas
import pandas as pd
df_emb = embeddings.to_pandas()
df_drugs = drug_metadata.to_pandas()
df_cells = cell_line_metadata.to_pandas()
# Join on drug name and cell line
df_enriched = df_emb.merge(df_drugs, on='drug').merge(df_cells, on='cell_line')
print(f"Enriched dataset shape: {df_enriched.shape}")
```
## License
Apache 2.0 (inherited from Tahoe-x1 model)
## Resources
- 🤗 [Tahoe-x1 Model Card](https://huggingface.co/tahoebio/Tahoe-x1)
- 🤗 [Tahoe-100M Dataset Card](https://huggingface.co/datasets/tahoebio/Tahoe-100M)
- 🚀 [Tahoe-x1 Interactive Demo](https://huggingface.co/spaces/tahoebio/Tahoe-x1)
- 📧 Contact: admin@tahoebio.ai
## Acknowledgments
This dataset builds upon the foundational work of Tahoe Therapeutics and Vevo Therapeutics in creating large-scale single-cell perturbation atlases and state-of-the-art foundation models for cellular biology.
许可证:Apache-2.0
任务类别:
- 特征提取(feature-extraction)
标签:
- 生物学(biology)
- 单细胞(single-cell)
- 转录组学(transcriptomics)
- 嵌入向量(embeddings)
- 药物发现(drug-discovery)
- 癌症(cancer)
- 扰动(perturbation)
- 基础模型(foundation-model)
数据规模:
- 10M < 样本数 < 100M
语言:
- 英语
展示名称:Tahoe-x1嵌入向量 on Tahoe-100M
配置:
- 配置名称:default
数据文件:
- 拆分方式:train
路径:data/*.parquet
# Tahoe-x1 嵌入向量 on Tahoe-100M
本数据集为将[Tahoe-x1基础模型(foundation model)](https://huggingface.co/tahoebio/Tahoe-x1)生成的预计算嵌入向量应用于[Tahoe-100M数据集](https://huggingface.co/datasets/tahoebio/Tahoe-100M)所得的产物,提供了经小分子扰动处理的癌细胞系单细胞转录组谱的高维表征。
## 概览
本数据集包含使用**Tahoe-x1-3B**模型生成的细胞嵌入向量,该模型是一款经过扰动训练的30亿参数单细胞基础模型。该嵌入向量可捕捉以下维度的细胞状态:
- 覆盖多种组织类型的**50株癌细胞系**
- 约1100种具有多样作用机制的**小分子化合物**
- 源自原始[Tahoe-100M数据集](https://huggingface.co/datasets/tahoebio/Tahoe-100M)的**1亿以上单细胞谱**
这些嵌入向量可直接用于下游任务,如药物反应预测、细胞状态分类及扰动效应分析,无需从原始表达数据重新计算生成。
如需了解模型架构与训练的详细信息,请参阅[Tahoe-x1模型卡片](https://huggingface.co/tahoebio/Tahoe-x1);如需了解源数据的相关信息,请参阅[Tahoe-100M数据集卡片](https://huggingface.co/datasets/tahoebio/Tahoe-100M)。
## 数据集结构
本数据集的每一行对应一条单细胞谱及其对应的嵌入向量:
| 列名 | 数据类型 | 描述 |
|------|----------|------|
| `drug` | `string` | 药物化合物名称(例如:"8-Hydroxyquinoline") |
| `sample` | `string` | 源自Tahoe-100M的样本标识符(例如:"smp_1783") |
| `cell_line` | `string` | Cellosaurus细胞系标识符(例如:"CVCL_1717"、"CVCL_0480") |
| `BARCODE_SUB_LIB_ID` | `string` | 子文库的唯一条形码标识符(共19个字符) |
| `mosaicfm-3b-prod-cont-MFMv2` | `list[float]` | 来自Tahoe-x1-3B的细胞嵌入向量 |
**注意**:嵌入向量列的名称反映了生成过程中使用的内部模型版本。
数据文件以Parquet格式存储于`data/`目录中,可高效流式加载与读取。
## 快速入门
python
from datasets import load_dataset
# 流式加载数据集,无需提前下载
ds = load_dataset("tahoebio/Tahoe-x1-embeddings", streaming=True, split="train")
# 获取第一条数据
example = next(iter(ds))
print(example)
**注意**:若遇到架构解析错误,请使用以下替代方案:
python
from datasets import load_dataset
# 直接加载Parquet文件
ds = load_dataset(
"parquet",
data_files="hf://datasets/tahoebio/Tahoe-x1-embeddings/data/*.parquet",
streaming=True,
split="train"
)
## 源信息
嵌入向量由[Tahoe-x1-3B模型](https://huggingface.co/tahoebio/Tahoe-x1)在[Tahoe-100M数据集](https://huggingface.co/datasets/tahoebio/Tahoe-100M)上生成。
## 关联Tahoe-100M元数据
如需为这些嵌入向量补充来自Tahoe-100M的额外元数据,请执行以下操作:
python
from datasets import load_dataset
# 加载嵌入向量
embeddings = load_dataset("tahoebio/Tahoe-x1-embeddings", split="train")
# 加载药物元数据
drug_metadata = load_dataset("tahoebio/Tahoe-100M", "drug_metadata", split="train")
# 加载细胞系元数据
cell_line_metadata = load_dataset("tahoebio/Tahoe-100M", "cell_line_metadata", split="train")
# 使用pandas进行合并
import pandas as pd
df_emb = embeddings.to_pandas()
df_drugs = drug_metadata.to_pandas()
df_cells = cell_line_metadata.to_pandas()
# 基于药物名称与细胞系进行连接
df_enriched = df_emb.merge(df_drugs, on='drug').merge(df_cells, on='cell_line')
print(f"富集后的数据集形状:{df_enriched.shape}")
## 许可证
Apache 2.0(继承自Tahoe-x1模型)
## 资源
- 🤗 [Tahoe-x1模型卡片](https://huggingface.co/tahoebio/Tahoe-x1)
- 🤗 [Tahoe-100M数据集卡片](https://huggingface.co/datasets/tahoebio/Tahoe-100M)
- 🚀 [Tahoe-x1交互式演示空间](https://huggingface.co/spaces/tahoebio/Tahoe-x1)
- 📧 联系方式:admin@tahoebio.ai
## 致谢
本数据集基于Tahoe Therapeutics与Vevo Therapeutics的开创性工作,二者构建了大规模单细胞扰动图谱以及用于细胞生物学的前沿基础模型。
提供机构:
tahoebio



