slaf-project/Tahoe-100M
收藏Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/slaf-project/Tahoe-100M
下载链接
链接失效反馈官方服务:
资源简介:
---
viewer: true
license: cc0-1.0
configs:
- config_name: train-cells
data_dir: "data/train/cells.lance"
- config_name: train-expression
data_dir: "data/train/expression.lance"
- config_name: train-genes
data_dir: "data/train/genes.lance"
- config_name: test-cells
data_dir: "data/test/cells.lance"
- config_name: test-expression
data_dir: "data/test/expression.lance"
- config_name: test-genes
data_dir: "data/test/genes.lance"
language:
- en
tags:
- biology
- chemistry
- RNA
- single-cell
- lance
- slaf
pretty_name: Tahoe-100M
---
# Tahoe-100M Dataset (SLAF Format)
## Attribution
**This is a re-release of data originally generated by [Tahoe Therapeutics](https://huggingface.co/tahoebio).**
* **Original Dataset**: [tahoebio/Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M)
* **Original Format**: Parquet files
* **This Release**: Same data in SLAF (Sparse Lazy Array Format)
* **License**: CC0-1.0 (Creative Commons CC0 1.0 Universal - Public Domain)
* **Original Citation**:
```
@article{zhang2025tahoe,
title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling},
author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others},
journal={bioRxiv},
pages={2025--02},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
```
For detailed information about the dataset, methodology, and original publication, please refer to the [original dataset repository](https://huggingface.co/datasets/tahoebio/Tahoe-100M).
## Dataset Description
Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This release provides the same data in SLAF format for compatibility with SLAF tools.
## Usage
This dataset is in [SLAF (Sparse Lazy Array Format)](https://slaf-project.github.io/slaf/) format, which uses the [Lance](https://lance.org/) table format for storage. You can use it with the `slafdb` library (for SLAF format), or `pylance` library (for direct Lance access).
### Using SLAF (Recommended for SLAF Format)
```bash
pip install slafdb
```
```python
# Load train dataset
hf_path = 'hf://datasets/slaf-project/Tahoe-100M'
from slaf import SLAFArray
train_slaf = SLAFArray(f"{hf_path}/data/train")
train_slaf.query("SELECT * FROM cells LIMIT 10")
# Load test dataset
test_slaf = SLAFArray(f"{hf_path}/data/test")
test_slaf.query("SELECT * FROM cells LIMIT 10")
```
### Using Lance Directly
```bash
pip install pylance
```
```python
# Load train dataset
hf_path = 'hf://datasets/slaf-project/Tahoe-100M'
import lance
train_lance = lance.dataset(f"{hf_path}/data/train/cells.lance")
train_lance.sample(10)
# Load test dataset
test_lance = lance.dataset(f"{hf_path}/data/test/cells.lance")
test_lance.sample(10)
```
查看功能:已启用
许可证:CC0-1.0
配置项:
- 配置名称:train-cells
数据目录:"data/train/cells.lance"
- 配置名称:train-expression
数据目录:"data/train/expression.lance"
- 配置名称:train-genes
数据目录:"data/train/genes.lance"
- 配置名称:test-cells
数据目录:"data/test/cells.lance"
- 配置名称:test-expression
数据目录:"data/test/expression.lance"
- 配置名称:test-genes
数据目录:"data/test/genes.lance"
语言:英语
标签:生物学、化学、RNA、单细胞、Lance、SLAF
友好名称:Tahoe-100M
# Tahoe-100M数据集(SLAF格式)
## 归属说明
**本数据集为[Tahoe Therapeutics](https://huggingface.co/tahoebio)原始生成数据的重新发布版本。**
* **原始数据集**:[tahoebio/Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M)
* **原始格式**:Parquet文件
* **本次发布格式**:采用SLAF(Sparse Lazy Array Format,稀疏懒加载数组格式)存储的相同数据集
* **许可证**:CC0-1.0(知识共享CC0 1.0通用公共领域授权)
* **原始引用**:
@article{zhang2025tahoe,
title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling},
author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others},
journal={bioRxiv},
pages={2025--02},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
如需了解该数据集的详细信息、研究方法及原始发表内容,请参阅[原始数据集仓库](https://huggingface.co/datasets/tahoebio/Tahoe-100M)。
## 数据集概述
Tahoe-100M是一个十亿级规模的单细胞扰动图谱,包含来自50种癌细胞系、经1100种小分子扰动处理后的超过1亿条转录组谱数据。该数据集通过Vevo Therapeutics的Mosaic高通量平台生成,能够以前所未有的规模与分辨率,实现对基因功能、细胞状态及药物反应的深度情境感知式探索。本次发布采用SLAF格式存储相同数据集,以适配SLAF相关工具。
## 使用方法
本数据集采用[SLAF(稀疏懒加载数组格式,Sparse Lazy Array Format)](https://slaf-project.github.io/slaf/)存储,该格式基于[Lance(Lance)](https://lance.org/)表格格式实现。您可通过`slafdb`库(适配SLAF格式)或`pylance`库(直接访问Lance格式数据)使用该数据集。
### 使用SLAF(推荐用于SLAF格式数据)
bash
pip install slafdb
python
# 加载训练数据集
hf_path = "hf://datasets/slaf-project/Tahoe-100M"
from slaf import SLAFArray
train_slaf = SLAFArray(f"{hf_path}/data/train")
train_slaf.query("SELECT * FROM cells LIMIT 10")
# 加载测试数据集
test_slaf = SLAFArray(f"{hf_path}/data/test")
test_slaf.query("SELECT * FROM cells LIMIT 10")
### 直接使用Lance
bash
pip install pylance
python
# 加载训练数据集
hf_path = "hf://datasets/slaf-project/Tahoe-100M"
import lance
train_lance = lance.dataset(f"{hf_path}/data/train/cells.lance")
train_lance.sample(10)
# 加载测试数据集
test_lance = lance.dataset(f"{hf_path}/data/test/cells.lance")
test_lance.sample(10)
提供机构:
slaf-project



