AlphaGenome embedding probing: data products for cardiac chromatin accessibility analysis
收藏DataCite Commons2026-05-04 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20022208
下载链接
链接失效反馈官方服务:
资源简介:
This record archives the data products generated for the manuscript"Linear probing of AlphaGenome embeddings recovers tissue-specificregulatory state without retraining" (Astudillo, 2026).
The archive (~5 GB) contains everything needed to reproduce all figuresand tables in the manuscript without re-running AlphaGenome's forwardpredictions (~25 hours of compute) or embedding extraction (~10 hours).
CONTENTS
Per chromosome (chr1, chr11, chr22): - predict/ AlphaGenome DNase head predictions at 128 bp resolution (parquet) and the corresponding track metadata (JSON) - labels/ Per-bin binary peak-overlap labels for the nine training biosamples and three held-out biosamples (parquet) - peaks/ ENCODE narrowPeak BED files for the nine training biosamples, filtered to chromosome - peaks_holdout/ ENCODE peak files for the three held-out biosamples (cardiac myoblast CL:0010021, substantia nigra UBERON:0002038, NCI-H510A EFO:0006693), filtered to chromosome and with raw downloads preserved for provenance - results/ All result JSON outputs from the linear-probe pipeline (within-chromosome AUROC, per-tissue specificity, embedding-vs-AlphaGenome-heads comparison, held-out biosample generalization, distributed-encoding analysis)
H3K27ac modality replication (chr1_h3k27ac, chr11_h3k27ac, chr22_h3k27ac): - labels/ Per-bin H3K27ac peak-overlap labels - peaks/ ENCODE H3K27ac narrow-peak BED files - results/ Cardiac-vs-non-cardiac probe results on the H3K27ac task
Top-level: - track_metadata/ AlphaGenome's official track metadata catalog (DNase, ATAC, RNA-seq, ChIP-seq, splicing) and the held-out-candidate selection summary
NOT INCLUDED
Raw AlphaGenome 3,072-dimensional per-bin embedding parquets (~40 GB) arenot included because they are deterministic from AlphaGenome's publishedweights and the genomic coordinates. They can be regenerated using02_extract_embeddings.py from the accompanying code repository (seeASSOCIATED RECORDS below) given access to AlphaGenome.
DATA PROVENANCE
ENCODE narrowPeak BED files were downloaded from the ENCODE portal(https://www.encodeproject.org). Specific accessions for each biosampleand chromosome are listed in the manuscript Methods and embedded in thepeaks/ summary.json files. The AlphaGenome model used to produce thepredict/ outputs is the all-fold ensemble published by Avsec et al. 2025and obtainable from https://github.com/google-deepmind/alphagenome.
REPRODUCING THE MANUSCRIPT FIGURES AND TABLES
1. Download and extract this archive.2. Clone the code repository (see ASSOCIATED RECORDS below).3. Update the configs/ YAML files in the code repo to point at the extracted data directory, or symlink the data tree to the expected default location.4. Run the figure_*.py and make_tables.py scripts.
Full step-by-step instructions are in the README.md of the coderepository.
ASSOCIATED RECORDS
- Code: [10.5281/zenodo.20022389]- AlphaGenome: Avsec et al. (2025), Nature, doi: [https://doi.org/10.1038/s41586-025-10014-0]
提供机构:
Zenodo
创建时间:
2026-05-04



