five

AlphaGenome embedding probing: data products for cardiac chromatin accessibility analysis

收藏
DataCite Commons2026-05-04 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20022208
下载链接
链接失效反馈
官方服务:
资源简介:
This record archives the data products generated for the manuscript"Linear probing of AlphaGenome embeddings recovers tissue-specificregulatory state without retraining" (Astudillo, 2026). The archive (~5 GB) contains everything needed to reproduce all figuresand tables in the manuscript without re-running AlphaGenome's forwardpredictions (~25 hours of compute) or embedding extraction (~10 hours). CONTENTS Per chromosome (chr1, chr11, chr22):  - predict/                AlphaGenome DNase head predictions at 128 bp                            resolution (parquet) and the corresponding                            track metadata (JSON)  - labels/                 Per-bin binary peak-overlap labels for the                            nine training biosamples and three held-out                            biosamples (parquet)  - peaks/                  ENCODE narrowPeak BED files for the nine                            training biosamples, filtered to chromosome  - peaks_holdout/          ENCODE peak files for the three held-out                            biosamples (cardiac myoblast CL:0010021,                            substantia nigra UBERON:0002038, NCI-H510A                            EFO:0006693), filtered to chromosome and with                            raw downloads preserved for provenance  - results/                All result JSON outputs from the linear-probe                            pipeline (within-chromosome AUROC, per-tissue                            specificity, embedding-vs-AlphaGenome-heads                            comparison, held-out biosample generalization,                            distributed-encoding analysis) H3K27ac modality replication (chr1_h3k27ac, chr11_h3k27ac, chr22_h3k27ac):  - labels/                 Per-bin H3K27ac peak-overlap labels  - peaks/                  ENCODE H3K27ac narrow-peak BED files  - results/                Cardiac-vs-non-cardiac probe results on the                            H3K27ac task Top-level:  - track_metadata/         AlphaGenome's official track metadata catalog                            (DNase, ATAC, RNA-seq, ChIP-seq, splicing) and                            the held-out-candidate selection summary NOT INCLUDED Raw AlphaGenome 3,072-dimensional per-bin embedding parquets (~40 GB) arenot included because they are deterministic from AlphaGenome's publishedweights and the genomic coordinates. They can be regenerated using02_extract_embeddings.py from the accompanying code repository (seeASSOCIATED RECORDS below) given access to AlphaGenome. DATA PROVENANCE ENCODE narrowPeak BED files were downloaded from the ENCODE portal(https://www.encodeproject.org). Specific accessions for each biosampleand chromosome are listed in the manuscript Methods and embedded in thepeaks/ summary.json files. The AlphaGenome model used to produce thepredict/ outputs is the all-fold ensemble published by Avsec et al. 2025and obtainable from https://github.com/google-deepmind/alphagenome. REPRODUCING THE MANUSCRIPT FIGURES AND TABLES 1. Download and extract this archive.2. Clone the code repository (see ASSOCIATED RECORDS below).3. Update the configs/ YAML files in the code repo to point at the   extracted data directory, or symlink the data tree to the expected   default location.4. Run the figure_*.py and make_tables.py scripts. Full step-by-step instructions are in the README.md of the coderepository. ASSOCIATED RECORDS - Code: [10.5281/zenodo.20022389]- AlphaGenome: Avsec et al. (2025), Nature, doi: [https://doi.org/10.1038/s41586-025-10014-0]
提供机构:
Zenodo
创建时间:
2026-05-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作