five

slaf-project/Tahoe-100M

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/slaf-project/Tahoe-100M
下载链接
链接失效反馈
官方服务:
资源简介:
--- viewer: true license: cc0-1.0 configs: - config_name: train-cells data_dir: "data/train/cells.lance" - config_name: train-expression data_dir: "data/train/expression.lance" - config_name: train-genes data_dir: "data/train/genes.lance" - config_name: test-cells data_dir: "data/test/cells.lance" - config_name: test-expression data_dir: "data/test/expression.lance" - config_name: test-genes data_dir: "data/test/genes.lance" language: - en tags: - biology - chemistry - RNA - single-cell - lance - slaf pretty_name: Tahoe-100M --- # Tahoe-100M Dataset (SLAF Format) ## Attribution **This is a re-release of data originally generated by [Tahoe Therapeutics](https://huggingface.co/tahoebio).** * **Original Dataset**: [tahoebio/Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M) * **Original Format**: Parquet files * **This Release**: Same data in SLAF (Sparse Lazy Array Format) * **License**: CC0-1.0 (Creative Commons CC0 1.0 Universal - Public Domain) * **Original Citation**: ``` @article{zhang2025tahoe, title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling}, author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} } ``` For detailed information about the dataset, methodology, and original publication, please refer to the [original dataset repository](https://huggingface.co/datasets/tahoebio/Tahoe-100M). ## Dataset Description Tahoe-100M is a giga-scale single-cell perturbation atlas consisting of over 100 million transcriptomic profiles from 50 cancer cell lines exposed to 1,100 small-molecule perturbations. Generated using Vevo Therapeutics' Mosaic high-throughput platform, Tahoe-100M enables deep, context-aware exploration of gene function, cellular states, and drug responses at unprecedented scale and resolution. This release provides the same data in SLAF format for compatibility with SLAF tools. ## Usage This dataset is in [SLAF (Sparse Lazy Array Format)](https://slaf-project.github.io/slaf/) format, which uses the [Lance](https://lance.org/) table format for storage. You can use it with the `slafdb` library (for SLAF format), or `pylance` library (for direct Lance access). ### Using SLAF (Recommended for SLAF Format) ```bash pip install slafdb ``` ```python # Load train dataset hf_path = 'hf://datasets/slaf-project/Tahoe-100M' from slaf import SLAFArray train_slaf = SLAFArray(f"{hf_path}/data/train") train_slaf.query("SELECT * FROM cells LIMIT 10") # Load test dataset test_slaf = SLAFArray(f"{hf_path}/data/test") test_slaf.query("SELECT * FROM cells LIMIT 10") ``` ### Using Lance Directly ```bash pip install pylance ``` ```python # Load train dataset hf_path = 'hf://datasets/slaf-project/Tahoe-100M' import lance train_lance = lance.dataset(f"{hf_path}/data/train/cells.lance") train_lance.sample(10) # Load test dataset test_lance = lance.dataset(f"{hf_path}/data/test/cells.lance") test_lance.sample(10) ```

查看功能:已启用 许可证:CC0-1.0 配置项: - 配置名称:train-cells 数据目录:"data/train/cells.lance" - 配置名称:train-expression 数据目录:"data/train/expression.lance" - 配置名称:train-genes 数据目录:"data/train/genes.lance" - 配置名称:test-cells 数据目录:"data/test/cells.lance" - 配置名称:test-expression 数据目录:"data/test/expression.lance" - 配置名称:test-genes 数据目录:"data/test/genes.lance" 语言:英语 标签:生物学、化学、RNA、单细胞、Lance、SLAF 友好名称:Tahoe-100M # Tahoe-100M数据集(SLAF格式) ## 归属说明 **本数据集为[Tahoe Therapeutics](https://huggingface.co/tahoebio)原始生成数据的重新发布版本。** * **原始数据集**:[tahoebio/Tahoe-100M](https://huggingface.co/datasets/tahoebio/Tahoe-100M) * **原始格式**:Parquet文件 * **本次发布格式**:采用SLAF(Sparse Lazy Array Format,稀疏懒加载数组格式)存储的相同数据集 * **许可证**:CC0-1.0(知识共享CC0 1.0通用公共领域授权) * **原始引用**: @article{zhang2025tahoe, title={Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling}, author={Zhang, Jesse and Ubas, Airol A and de Borja, Richard and Svensson, Valentine and Thomas, Nicole and Thakar, Neha and Lai, Ian and Winters, Aidan and Khan, Umair and Jones, Matthew G and others}, journal={bioRxiv}, pages={2025--02}, year={2025}, publisher={Cold Spring Harbor Laboratory} } 如需了解该数据集的详细信息、研究方法及原始发表内容,请参阅[原始数据集仓库](https://huggingface.co/datasets/tahoebio/Tahoe-100M)。 ## 数据集概述 Tahoe-100M是一个十亿级规模的单细胞扰动图谱,包含来自50种癌细胞系、经1100种小分子扰动处理后的超过1亿条转录组谱数据。该数据集通过Vevo Therapeutics的Mosaic高通量平台生成,能够以前所未有的规模与分辨率,实现对基因功能、细胞状态及药物反应的深度情境感知式探索。本次发布采用SLAF格式存储相同数据集,以适配SLAF相关工具。 ## 使用方法 本数据集采用[SLAF(稀疏懒加载数组格式,Sparse Lazy Array Format)](https://slaf-project.github.io/slaf/)存储,该格式基于[Lance(Lance)](https://lance.org/)表格格式实现。您可通过`slafdb`库(适配SLAF格式)或`pylance`库(直接访问Lance格式数据)使用该数据集。 ### 使用SLAF(推荐用于SLAF格式数据) bash pip install slafdb python # 加载训练数据集 hf_path = "hf://datasets/slaf-project/Tahoe-100M" from slaf import SLAFArray train_slaf = SLAFArray(f"{hf_path}/data/train") train_slaf.query("SELECT * FROM cells LIMIT 10") # 加载测试数据集 test_slaf = SLAFArray(f"{hf_path}/data/test") test_slaf.query("SELECT * FROM cells LIMIT 10") ### 直接使用Lance bash pip install pylance python # 加载训练数据集 hf_path = "hf://datasets/slaf-project/Tahoe-100M" import lance train_lance = lance.dataset(f"{hf_path}/data/train/cells.lance") train_lance.sample(10) # 加载测试数据集 test_lance = lance.dataset(f"{hf_path}/data/test/cells.lance") test_lance.sample(10)
提供机构:
slaf-project
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作