perturbai/wholebrain_crispr_atlas
收藏Hugging Face2026-03-20 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/perturbai/wholebrain_crispr_atlas
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: cc-by-4.0
extra_gated_prompt: "To access this landmark in vivo atlas, you must provide an official corporate or academic email address. Requests from personal domains (@gmail.com, @hotmail.com, @qq.com, etc.) will be automatically denied."
extra_gated_fields:
First and Last Name: text
Company or Academic Affiliation: text
Official Institutional Email: text
I confirm I have provided an official institutional email and understand personal email requests will be rejected: checkbox
tags:
- biology
- genomics
- CRISPR
- Perturb-seq
- single-cell
- neuroscience
- causal-ai
- perturbai
pretty_name: PerturbAI Brain-Wide Functional Genomics Atlas (v1.0)
size_categories:
- 1M<n<10M
configs:
- config_name: default
data_files:
- split: train
path: data/*.parquet
default: true
- config_name: gene_metadata
data_files:
- split: train
path: metadata/gene_metadata.parquet
---
# PerturbAI Brain-Wide In Vivo CRISPR Atlas
This dataset represents a landmark in functional genomics: spanning 8 million single cells in living tissue and hundreds of distinct neuronal cell types, this is the most expansive in vivo functional genomics resource ever created. By mapping the language of biology at an unprecedented scale, our platform provides the foundation for the next generation of AI-driven therapeutic discovery.
**Manuscript:** [“Genome-scale functional mapping of the mammalian whole brain with in vivo Perturb-seq”](https://www.biorxiv.org/content/10.64898/2026.03.16.711480v1) on bioRxiv
**Summary:** Check out our blog - [www.perturb.ai/news](https://www.perturb.ai/news)
**Data:** Download the full dataset on Hugging Face
**Analysis:** Explore the dataset with the NVIDIA AI Blueprint for Single-Cell Analysis that leverages scverse’s RAPIDS-singlecell on RTX PRO 6000 Blackwell Workstation Edition, helping PerturbAI speed up analysis from days to near real-time ([link](https://build.nvidia.com/nvidia/single-cell-analysis))
<p align="center">
<img src="assets/neighborhood_umap_square.png" alt="8M brain cells with 2000 gene knockouts" width="500">
</p>
---
## **Dataset Description**
Using large-scale CRISPR screening and single-nucleus RNA sequencing, we’ve built a functional map of the mouse brain's genome. Measuring the effects of nearly 2,000 disease-linked genes in their native environment, we’ve revealed the molecular logic of the neuronal circuits underlying neurodegeneration, psychiatric, and metabolic diseases.
### **Key Highlights:**
- **Scale:** 7.7 million cells, with single nuclear profiling data across 19,070 mRNAs and 8,588 sgRNAs.
- **Resolution:** Brain-wide coverage, capturing the gene function across hundreds of cell types in vivo.
- **Causality:** Moving beyond correlation to causal inference through large-scale, parallelized perturbations.
---
## **Data Structure & Formats**
To support diverse workflows, this repository includes:
| Format | File/Folder | Primary Use Case |
| :--------------------- | :----------------------------------------------------------- | :------------------------------------------------------------------------------------------------------- |
| **Parquet (cells)** | `data/*.parquet` | Distributed per-cell expression and metadata for scalable analytics and ML pipelines. |
| **Parquet (metadata)** | `metadata/all_obs.parquet`, `metadata/gene_metadata.parquet` | Curated cell-level and gene-level metadata tables. |
| **AnnData shards** | `h5ads/*.h5ad` | Per-channel AnnData files for Scanpy/scvi-tools/Seurat/SingleCellExperiment workflows. |
| **Zarr archive (LFS)** | `analysis/preprocessed_gex.zarr.tar.gz` | For [NVIDIA AI Blueprint for Single-Cell Analysis](https://build.nvidia.com/nvidia/single-cell-analysis) |
| **Misc** | `analysis/2603_shi_manuscript/*` | Data related to reproducing figures in our manuscript. See [github.com/jinlabneurogenomics/wholebrainperturbseq](https://github.com/jinlabneurogenomics/wholebrainperturbseq) |
---
## **Metadata Columns**
The following columns describe per-cell metadata fields used across the atlas:
| Column | Description |
| :-------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `batch` | Represents a single Flex-pool of samples. |
| `scp_name` | Identifier for the 10x channel where a batch was processed; each batch was processed on multiple 10x channels. |
| `source` | Biological source (mouse) for this cell. |
| `sex` | Mouse sex (`M` or `F`). |
| `sample_label` | Distinguishes samples from the same source (commonly left `L` and right `R` hemisphere samples). |
| `num_rna_umi` | Number of detected RNA UMIs in this cell. |
| `num_genes` | Number of unique genes detected in this cell. |
| `pct_mt` | Percent of UMIs coming from mitochondrial genes. |
| `scDblFinder.class` | Doublet call from scDblFinder (`singlet` or `doublet`). |
| `scDblFinder.score` | Doublet score from scDblFinder (0-1; values near 1 indicate higher doublet likelihood). |
| `log_ambient_mse` | Log MSE of each cell relative to channel-average expression across genes (see methods in publication). |
| `log_ambient_mse_norm` | `log_ambient_mse` normalized by expected log MSE under a binomial sampling assumption (see methods in publication). |
| `gene_target` | Gene(s) knocked out in this cell: `gene`, `gene1\|gene2\|...`, `Non_target` (non-targeting guide), or `Negative` (no sufficiently detected guide). |
| `num_guides` | Number of guides detected at or above a 3 UMI threshold in this cell. |
| `guide_call` | List of detected guides, separated by `\|` when multiple; reports `Negative` if no guide is detected. |
| `guide_umis` | Total number of guide UMIs detected in this cell. |
| `guide_umi_top` | Guide UMI count for the most highly detected guide in this cell. |
| `guide_umi_second` | Guide UMI count for the second-most highly detected guide in this cell. |
| `predicted_group` | Custom group definition for this study, created by aggregating predicted subclasses (see publication). |
| `predicted_class` | Predicted class from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `predicted_class_probability` | Predicted class probability from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `predicted_subclass` | Predicted subclass from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `predicted_subclass_probability` | Predicted subclass probability from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `predicted_supertype` | Predicted supertype from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `predicted_supertype_probability` | Predicted supertype probability from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `predicted_cluster` | Predicted cluster from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `predicted_cluster_probability` | Predicted cluster probability from MapMyCells using Allen Institute Whole Mouse Brain Taxonomy. |
| `neuron_type` | From Allen Institute Whole Mouse Brain Taxonomy; derived from predicted subclass (`nt_type`). |
| `neighborhood` | From Allen Institute Whole Mouse Brain Taxonomy; derived from predicted subclass. |
| `region_level1` | From Allen Institute Whole Mouse Brain Taxonomy; coarse grouping of region_level2 assignment |
| `region_level2` | From Allen Institute Whole Mouse Brain Taxonomy; derived from predicted cluster, highest region in CCF_broad.freq |
| `cluster` | Cluster ID from unsupervised clustering; primarily used for QC and to identify additional doublet clusters missed by scDblFinder. |
| `passes_qc` | Boolean QC flag: `num_genes >= 2000`, `scDblFinder.class == "singlet"`, `log_ambient_mse_norm > 0.09`, and `cluster` not in `{"1", "17", "2", "3", "57", "6", "83", "NA"}`. |
---
## **How to Use**
### **Hugging Face Datasets**
```python
from datasets import load_dataset
# Load the default config defined in the dataset card (data/*.parquet)
ds = load_dataset("perturbai/wholebrain_crispr_atlas", split="train", streaming=True)
first_row = next(iter(ds))
print(first_row.keys())
```
### **AnnData**
```python
import glob
import anndata
from anndata.experimental import AnnCollection
# Open all h5ad shards in backed mode and wrap them in one collection
paths = sorted(glob.glob("h5ads/*.h5ad"))
adatas = [anndata.read_h5ad(path, backed="r") for path in paths]
collection = AnnCollection(adatas)
print("# Cells:", collection.n_obs)
# Load a subset of cells from disk into an AnnData object
ad_grin2a = collection[
(collection.obs["gene_target"] == "Grin2a")
& (collection.obs["passes_qc"])
].to_adata()
```
提供机构:
perturbai



