tobyclark/tf-scores
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tobyclark/tf-scores
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Functional Genomics ChIP SNP Scores
task_categories:
- other
language:
- en
license: apache-2.0
size_categories:
- 10M<n<100M
dataset_info:
features:
- name: snp
dtype: string
- name: chr
dtype: string
- name: pos
dtype: uint32
- name: ref_allele
dtype: string
- name: alt_allele
dtype: string
- name: variant_pair_ptr
dtype: uint64
- name: variant_output_start_bp
dtype: int64
- name: pair_track_idx
dtype: uint16_or_uint32
- name: pair_peak_ptr
dtype: uint64
- name: peak_score
dtype: float16
- name: peak_start_bin
dtype: uint16
- name: peak_end_bin
dtype: uint16
- name: peak_summit_bin
dtype: uint16
- name: center_mask_lfc
dtype: float16
- name: target_ids
dtype: string
- name: target_labels
dtype: string
---
# Dataset Card for Borzoi/AlphaGenome ChIP Scores
## Dataset Description
This dataset format stores per-variant Borzoi/AlphaGenome ChIP predictions scored from VCF variants.
The HDF5 file is created by `{borzoi_chip or alphagenome_chip}/scripts/score_snps.py`, creating shards for a VCF slice
`[start_idx, stop_idx]`, then combining.
The schema combines:
- fixed-size variant metadata,
- sparse pointer-based peak score tables,
- dense per-variant/per-target centre-window log fold-change values,
- run configuration in HDF5 attributes.
Note that the peak information is bin-based as the output format is 32bp resolution for Borzoi and 128bp for AlphaGenome, following their papers.
## Dataset Structure
### Data Files
- Currently contains the `borzoi_scores.h5` file, containing concatenated scores.
## Data Fields
Let:
- `n_variants = 19,534,182` - total number of SNPs/indels.
- `n_targets = 901` - total number of TFs across HepG2 and K562
### Variant metadata (fixed size)
- `snp`: `S50`, shape `(n_variants,)`, variant ID.
- `chr`: `S10`, shape `(n_variants,)`, chromosome.
- `pos`: `uint32`, shape `(n_variants,)`, 1-based genomic coordinate for the variant.
- `ref_allele`: `S100`, shape `(n_variants,)`, reference allele.
- `alt_allele`: `S100`, shape `(n_variants,)`, alternate allele.
### Variant-level pointers (fixed size)
A variant row corresponds to all `(variant, target track)` pairs for a given variant.
- `variant_pair_ptr`: `uint64`, shape `(n_variants + 1,)`.
- Variant `v` maps to pair rows in
`[variant_pair_ptr[v], variant_pair_ptr[v + 1])`.
- `variant_output_start_bp`: `int64`, shape `(n_variants,)`.
- Genomic coordinate of output bin 0 for variant `v`.
### Pair-level sparse table (variable size)
A pair row corresponds to a `(variant, target-track)` with at least one retained peak.
- `pair_track_idx`: `uint16` or `uint32`, shape `(n_pairs,)`.
- Target track index for each pair row - indicates the TF/cell type pair for the peak/set of peaks.
- `pair_peak_ptr`: `uint64`, shape `(n_pairs + 1,)`.
- Pair `p` maps to peak rows in `[pair_peak_ptr[p], pair_peak_ptr[p + 1])`.
- Shows the indices of the selected peak.
### Peak-level sparse table (variable size)
Contains scores and location information for the peaks, can be used with `variant_output_start_bp` to
- `peak_score`: `float16`, shape `(n_peaks,)`.
- Score is `log2(sum_alt + 1) - log2(sum_ref + 1)` over the peak interval.
- `peak_start_bin`: `uint16`, shape `(n_peaks,)`, inclusive start bin for the peak.
- `peak_end_bin`: `uint16`, shape `(n_peaks,)`, exclusive end bin for the peak.
- `peak_summit_bin`: `uint16`, shape `(n_peaks,)`, bin location of the peak summit.
### Dense centre-window scores (fixed size)
- `center_mask_lfc`: `float16`, shape `(n_variants, n_targets)`.
- Per-variant/per-target centre-window log fold-change:
`log2(sum_alt_center + 1) - log2(sum_ref_center + 1)`.
### Target metadata (fixed size)
- `target_ids`: byte strings, shape `(n_targets,)` - numeric track IDs
- `target_labels`: byte strings, shape `(n_targets,)` - track labels in the format "TF (cell type)"
### Run configuration metadata (h5 attributes)
- `score_type` - semantic score type
- `score_definitiion` - method/equation for scoring
- `peak_threshold` - fold change threshold for peak calling
- `threshold_space` - "linear_unscaled" means that transformations in preprocessing were reversed before peak scoring.
- `peak_union` - indicates whether peaks were called on reference, alternative or combined
- `split_prominence` - proportion of max peak height required to split overlapping peaks
- `min_peak_bins` - minimum number of bins to form a peak
- `min_sub_peak_bins` - minimum number of bins to form a peak after splitting
- `min_abs_peak_lfc` - minimum score change for a peak to be included in the output file.
- `max_peak_bins` - maximum size of peak - script uses the middle of the peak if it is too large.
- `center_mask_bins` - number of bins to use for centre mask LFC calculation
- `model_resolution` - resolution of model output predictions (32bp)
- `model_crop` - the crop at each end of input sequence length to produce the target length.
## Pointer Semantics
For a variant index `v`:
1. Pair range is `p in [variant_pair_ptr[v], variant_pair_ptr[v + 1])`.
2. Pair `p` uses track `pair_track_idx[p]`.
3. Peak range for `p` is `k in [pair_peak_ptr[p], pair_peak_ptr[p + 1])`.
4. Peak properties are read from `peak_score[k]`, `peak_start_bin[k]`, `peak_end_bin[k]`,
and `peak_summit_bin[k]`.
## Coordinate Reconstruction
Using attributes `model_resolution` and `model_crop`, for variant `v` and peak `k`:
- `start_bp = variant_output_start_bp[v] + peak_start_bin[k] * model_resolution`
- `end_bp = variant_output_start_bp[v] + peak_end_bin[k] * model_resolution`
- `summit_bp = variant_output_start_bp[v] + peak_summit_bin[k] * model_resolution`
## Dataset Creation
Generated by `{borzoi_chip or alphagenome_chip}/scripts/score_snps.py` from:
- a trained ChIP model,
- reference genome sequence,
- a VCF of variants,
- target metadata table.
数据集名称:功能基因组学染色质免疫共沉淀(Chromatin Immunoprecipitation, ChIP)单核苷酸多态性(Single Nucleotide Polymorphism, SNP)评分
任务类别:其他
语言:英语
许可证:Apache-2.0
样本规模:1000万<样本数<1亿
数据集信息:
特征:
- 名称:snp,数据类型:string
- 名称:chr,数据类型:string
- 名称:pos,数据类型:uint32
- 名称:ref_allele,数据类型:string
- 名称:alt_allele,数据类型:string
- 名称:variant_pair_ptr,数据类型:uint64
- 名称:variant_output_start_bp,数据类型:int64
- 名称:pair_track_idx,数据类型:uint16_or_uint32
- 名称:pair_peak_ptr,数据类型:uint64
- 名称:peak_score,数据类型:float16
- 名称:peak_start_bin,数据类型:uint16
- 名称:peak_end_bin,数据类型:uint16
- 名称:peak_summit_bin,数据类型:uint16
- 名称:center_mask_lfc,数据类型:float16
- 名称:target_ids,数据类型:string
- 名称:target_labels,数据类型:string
# Borzoi/AlphaGenome ChIP评分数据集卡片
## 数据集描述
本数据集格式用于存储基于变异识别格式(Variant Call Format, VCF)变异得到的单变异Borzoi/AlphaGenome ChIP预测评分。该分层数据格式5(Hierarchical Data Format 5, HDF5)文件由`{borzoi_chip或alphagenome_chip}/scripts/score_snps.py`生成,先为VCF片段`[start_idx, stop_idx]`创建数据分片,再进行合并。
该数据集的结构包含以下内容:
- 固定长度的变异元数据
- 基于指针的稀疏峰评分表
- 单变异/单靶标中心窗口对数倍变化(log fold-change, LFC)密集值
- 存储于HDF5属性中的运行配置参数
请注意,本数据集的峰信息基于分箱(bin)进行组织,这是因为按照对应论文的设定,Borzoi的模型输出分辨率为32bp,AlphaGenome则为128bp。
## 数据集结构
### 数据文件
目前仅包含`borzoi_scores.h5`文件,其中存储了合并后的评分数据。
### 数据字段
设:
- `n_variants = 19534182`:单核苷酸多态性/插入缺失(Single Nucleotide Polymorphism/Insertion-Deletion, SNP/indel)的总数量
- `n_targets = 901`:覆盖HepG2和K562细胞系的转录因子(Transcription Factor, TF)总靶标数
#### 变异元数据(固定长度)
- `snp`:类型为`S50`,形状为`(n_variants,)`,存储变异ID
- `chr`:类型为`S10`,形状为`(n_variants,)`,存储染色体编号
- `pos`:类型为`uint32`,形状为`(n_variants,)`,存储变异的1-based基因组坐标
- `ref_allele`:类型为`S100`,形状为`(n_variants,)`,存储参考等位基因
- `alt_allele`:类型为`S100`,形状为`(n_variants,)`,存储变异等位基因
#### 变异级指针(固定长度)
每一行变异对应给定变异的所有`(变异, 靶标轨道)`配对:
- `variant_pair_ptr`:类型为`uint64`,形状为`(n_variants + 1,)`。变异`v`对应的配对行范围为`[variant_pair_ptr[v], variant_pair_ptr[v + 1])`
- `variant_output_start_bp`:类型为`int64`,形状为`(n_variants,)`,存储变异`v`的输出分箱0对应的基因组坐标
#### 配对级稀疏表(可变长度)
每一行配对对应至少包含一个有效峰的`(变异, 靶标轨道)`组合:
- `pair_track_idx`:类型为`uint16`或`uint32`,形状为`(n_pairs,)`。每个配对行对应的靶标轨道索引,用于指示峰/峰集合对应的转录因子(TF)/细胞类型组合
- `pair_peak_ptr`:类型为`uint64`,形状为`(n_pairs + 1,)`。配对`p`对应的峰行范围为`[pair_peak_ptr[p], pair_peak_ptr[p + 1])`,指示所选峰的索引
#### 峰级稀疏表(可变长度)
存储峰的评分与位置信息,可与`variant_output_start_bp`结合使用:
- `peak_score`:类型为`float16`,形状为`(n_peaks,)`。评分为峰区间内的`log2(sum_alt + 1) - log2(sum_ref + 1)`,其中`sum_alt`为变异等位基因的信号总和,`sum_ref`为参考等位基因的信号总和
- `peak_start_bin`:类型为`uint16`,形状为`(n_peaks,)`,存储峰的包含式起始分箱位置
- `peak_end_bin`:类型为`uint16`,形状为`(n_peaks,)`,存储峰的排除式结束分箱位置
- `peak_summit_bin`:类型为`uint16`,形状为`(n_peaks,)`,存储峰峰顶所在的分箱位置
#### 密集中心窗口评分(固定长度)
- `center_mask_lfc`:类型为`float16`,形状为`(n_variants, n_targets)`。单变异/单靶标中心窗口对数倍变化值:`log2(sum_alt_center + 1) - log2(sum_ref_center + 1)`,其中`sum_alt_center`为变异等位基因中心窗口的信号总和,`sum_ref_center`为参考等位基因中心窗口的信号总和
#### 靶标元数据(固定长度)
- `target_ids`:字节字符串类型,形状为`(n_targets,)`,存储数值型靶标轨道ID
- `target_labels`:字节字符串类型,形状为`(n_targets,)`,存储格式为"TF(细胞类型)"的靶标轨道标签
#### 运行配置元数据(HDF5属性)
- `score_type`:语义化评分类型
- `score_definitiion`:评分所用的方法或计算公式
- `peak_threshold`:峰识别所用的倍变化阈值
- `threshold_space`:当取值为`linear_unscaled`时,表示在峰评分前已对预处理过程中的变换进行了逆操作
- `peak_union`:用于指示峰识别是基于参考等位基因、变异等位基因还是两者的组合
- `split_prominence`:拆分重叠峰所需的最大峰高度占比
- `min_peak_bins`:形成一个峰所需的最小分箱数量
- `min_sub_peak_bins`:拆分后形成峰所需的最小分箱数量
- `min_abs_peak_lfc`:峰被纳入输出文件所需的最小评分变化绝对值
- `max_peak_bins`:峰的最大允许分箱数量,若峰过大则脚本会取其中心区域
- `center_mask_bins`:用于计算中心窗口LFC的分箱数量
- `model_resolution`:模型输出预测的分辨率(32bp)
- `model_crop`:为得到目标序列长度,对输入序列两端进行的裁剪长度
## 指针语义
对于变异索引`v`:
1. 配对行的范围为`p ∈ [variant_pair_ptr[v], variant_pair_ptr[v + 1])`
2. 配对`p`对应的靶标轨道为`pair_track_idx[p]`
3. 配对`p`对应的峰行范围为`k ∈ [pair_peak_ptr[p], pair_peak_ptr[p + 1])`
4. 峰的属性可从`peak_score[k]`、`peak_start_bin[k]`、`peak_end_bin[k]`以及`peak_summit_bin[k]`中读取
## 坐标重构
借助`model_resolution`和`model_crop`属性,对于变异`v`和峰`k`:
- `start_bp = variant_output_start_bp[v] + peak_start_bin[k] * model_resolution`
- `end_bp = variant_output_start_bp[v] + peak_end_bin[k] * model_resolution`
- `summit_bp = variant_output_start_bp[v] + peak_summit_bin[k] * model_resolution`
## 数据集构建
本数据集由`{borzoi_chip或alphagenome_chip}/scripts/score_snps.py`基于以下内容生成:
- 已训练的ChIP模型
- 参考基因组序列
- 变异的VCF文件
- 靶标元数据表
提供机构:
tobyclark



