tobyclark/tf-scores

Name: tobyclark/tf-scores
Creator: tobyclark
Published: 2026-03-31 12:40:31
License: 暂无描述

Hugging Face2026-03-31 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/tobyclark/tf-scores

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Functional Genomics ChIP SNP Scores task_categories: - other language: - en license: apache-2.0 size_categories: - 10M<n<100M dataset_info: features: - name: snp dtype: string - name: chr dtype: string - name: pos dtype: uint32 - name: ref_allele dtype: string - name: alt_allele dtype: string - name: variant_pair_ptr dtype: uint64 - name: variant_output_start_bp dtype: int64 - name: pair_track_idx dtype: uint16_or_uint32 - name: pair_peak_ptr dtype: uint64 - name: peak_score dtype: float16 - name: peak_start_bin dtype: uint16 - name: peak_end_bin dtype: uint16 - name: peak_summit_bin dtype: uint16 - name: center_mask_lfc dtype: float16 - name: target_ids dtype: string - name: target_labels dtype: string --- # Dataset Card for Borzoi/AlphaGenome ChIP Scores ## Dataset Description This dataset format stores per-variant Borzoi/AlphaGenome ChIP predictions scored from VCF variants. The HDF5 file is created by `{borzoi_chip or alphagenome_chip}/scripts/score_snps.py`, creating shards for a VCF slice `[start_idx, stop_idx]`, then combining. The schema combines: - fixed-size variant metadata, - sparse pointer-based peak score tables, - dense per-variant/per-target centre-window log fold-change values, - run configuration in HDF5 attributes. Note that the peak information is bin-based as the output format is 32bp resolution for Borzoi and 128bp for AlphaGenome, following their papers. ## Dataset Structure ### Data Files - Currently contains the `borzoi_scores.h5` file, containing concatenated scores. ## Data Fields Let: - `n_variants = 19,534,182` - total number of SNPs/indels. - `n_targets = 901` - total number of TFs across HepG2 and K562 ### Variant metadata (fixed size) - `snp`: `S50`, shape `(n_variants,)`, variant ID. - `chr`: `S10`, shape `(n_variants,)`, chromosome. - `pos`: `uint32`, shape `(n_variants,)`, 1-based genomic coordinate for the variant. - `ref_allele`: `S100`, shape `(n_variants,)`, reference allele. - `alt_allele`: `S100`, shape `(n_variants,)`, alternate allele. ### Variant-level pointers (fixed size) A variant row corresponds to all `(variant, target track)` pairs for a given variant. - `variant_pair_ptr`: `uint64`, shape `(n_variants + 1,)`. - Variant `v` maps to pair rows in `[variant_pair_ptr[v], variant_pair_ptr[v + 1])`. - `variant_output_start_bp`: `int64`, shape `(n_variants,)`. - Genomic coordinate of output bin 0 for variant `v`. ### Pair-level sparse table (variable size) A pair row corresponds to a `(variant, target-track)` with at least one retained peak. - `pair_track_idx`: `uint16` or `uint32`, shape `(n_pairs,)`. - Target track index for each pair row - indicates the TF/cell type pair for the peak/set of peaks. - `pair_peak_ptr`: `uint64`, shape `(n_pairs + 1,)`. - Pair `p` maps to peak rows in `[pair_peak_ptr[p], pair_peak_ptr[p + 1])`. - Shows the indices of the selected peak. ### Peak-level sparse table (variable size) Contains scores and location information for the peaks, can be used with `variant_output_start_bp` to - `peak_score`: `float16`, shape `(n_peaks,)`. - Score is `log2(sum_alt + 1) - log2(sum_ref + 1)` over the peak interval. - `peak_start_bin`: `uint16`, shape `(n_peaks,)`, inclusive start bin for the peak. - `peak_end_bin`: `uint16`, shape `(n_peaks,)`, exclusive end bin for the peak. - `peak_summit_bin`: `uint16`, shape `(n_peaks,)`, bin location of the peak summit. ### Dense centre-window scores (fixed size) - `center_mask_lfc`: `float16`, shape `(n_variants, n_targets)`. - Per-variant/per-target centre-window log fold-change: `log2(sum_alt_center + 1) - log2(sum_ref_center + 1)`. ### Target metadata (fixed size) - `target_ids`: byte strings, shape `(n_targets,)` - numeric track IDs - `target_labels`: byte strings, shape `(n_targets,)` - track labels in the format "TF (cell type)" ### Run configuration metadata (h5 attributes) - `score_type` - semantic score type - `score_definitiion` - method/equation for scoring - `peak_threshold` - fold change threshold for peak calling - `threshold_space` - "linear_unscaled" means that transformations in preprocessing were reversed before peak scoring. - `peak_union` - indicates whether peaks were called on reference, alternative or combined - `split_prominence` - proportion of max peak height required to split overlapping peaks - `min_peak_bins` - minimum number of bins to form a peak - `min_sub_peak_bins` - minimum number of bins to form a peak after splitting - `min_abs_peak_lfc` - minimum score change for a peak to be included in the output file. - `max_peak_bins` - maximum size of peak - script uses the middle of the peak if it is too large. - `center_mask_bins` - number of bins to use for centre mask LFC calculation - `model_resolution` - resolution of model output predictions (32bp) - `model_crop` - the crop at each end of input sequence length to produce the target length. ## Pointer Semantics For a variant index `v`: 1. Pair range is `p in [variant_pair_ptr[v], variant_pair_ptr[v + 1])`. 2. Pair `p` uses track `pair_track_idx[p]`. 3. Peak range for `p` is `k in [pair_peak_ptr[p], pair_peak_ptr[p + 1])`. 4. Peak properties are read from `peak_score[k]`, `peak_start_bin[k]`, `peak_end_bin[k]`, and `peak_summit_bin[k]`. ## Coordinate Reconstruction Using attributes `model_resolution` and `model_crop`, for variant `v` and peak `k`: - `start_bp = variant_output_start_bp[v] + peak_start_bin[k] * model_resolution` - `end_bp = variant_output_start_bp[v] + peak_end_bin[k] * model_resolution` - `summit_bp = variant_output_start_bp[v] + peak_summit_bin[k] * model_resolution` ## Dataset Creation Generated by `{borzoi_chip or alphagenome_chip}/scripts/score_snps.py` from: - a trained ChIP model, - reference genome sequence, - a VCF of variants, - target metadata table.

数据集名称：功能基因组学染色质免疫共沉淀（Chromatin Immunoprecipitation, ChIP）单核苷酸多态性（Single Nucleotide Polymorphism, SNP）评分任务类别：其他语言：英语许可证：Apache-2.0 样本规模：1000万<样本数<1亿数据集信息：特征： - 名称：snp，数据类型：string - 名称：chr，数据类型：string - 名称：pos，数据类型：uint32 - 名称：ref_allele，数据类型：string - 名称：alt_allele，数据类型：string - 名称：variant_pair_ptr，数据类型：uint64 - 名称：variant_output_start_bp，数据类型：int64 - 名称：pair_track_idx，数据类型：uint16_or_uint32 - 名称：pair_peak_ptr，数据类型：uint64 - 名称：peak_score，数据类型：float16 - 名称：peak_start_bin，数据类型：uint16 - 名称：peak_end_bin，数据类型：uint16 - 名称：peak_summit_bin，数据类型：uint16 - 名称：center_mask_lfc，数据类型：float16 - 名称：target_ids，数据类型：string - 名称：target_labels，数据类型：string # Borzoi/AlphaGenome ChIP评分数据集卡片 ## 数据集描述本数据集格式用于存储基于变异识别格式（Variant Call Format, VCF）变异得到的单变异Borzoi/AlphaGenome ChIP预测评分。该分层数据格式5（Hierarchical Data Format 5, HDF5）文件由`{borzoi_chip或alphagenome_chip}/scripts/score_snps.py`生成，先为VCF片段`[start_idx, stop_idx]`创建数据分片，再进行合并。该数据集的结构包含以下内容： - 固定长度的变异元数据 - 基于指针的稀疏峰评分表 - 单变异/单靶标中心窗口对数倍变化（log fold-change, LFC）密集值 - 存储于HDF5属性中的运行配置参数请注意，本数据集的峰信息基于分箱（bin）进行组织，这是因为按照对应论文的设定，Borzoi的模型输出分辨率为32bp，AlphaGenome则为128bp。 ## 数据集结构 ### 数据文件目前仅包含`borzoi_scores.h5`文件，其中存储了合并后的评分数据。 ### 数据字段设： - `n_variants = 19534182`：单核苷酸多态性/插入缺失（Single Nucleotide Polymorphism/Insertion-Deletion, SNP/indel）的总数量 - `n_targets = 901`：覆盖HepG2和K562细胞系的转录因子（Transcription Factor, TF）总靶标数 #### 变异元数据（固定长度） - `snp`：类型为`S50`，形状为`(n_variants,)`，存储变异ID - `chr`：类型为`S10`，形状为`(n_variants,)`，存储染色体编号 - `pos`：类型为`uint32`，形状为`(n_variants,)`，存储变异的1-based基因组坐标 - `ref_allele`：类型为`S100`，形状为`(n_variants,)`，存储参考等位基因 - `alt_allele`：类型为`S100`，形状为`(n_variants,)`，存储变异等位基因 #### 变异级指针（固定长度）每一行变异对应给定变异的所有`(变异, 靶标轨道)`配对： - `variant_pair_ptr`：类型为`uint64`，形状为`(n_variants + 1,)`。变异`v`对应的配对行范围为`[variant_pair_ptr[v], variant_pair_ptr[v + 1])` - `variant_output_start_bp`：类型为`int64`，形状为`(n_variants,)`，存储变异`v`的输出分箱0对应的基因组坐标 #### 配对级稀疏表（可变长度）每一行配对对应至少包含一个有效峰的`(变异, 靶标轨道)`组合： - `pair_track_idx`：类型为`uint16`或`uint32`，形状为`(n_pairs,)`。每个配对行对应的靶标轨道索引，用于指示峰/峰集合对应的转录因子（TF）/细胞类型组合 - `pair_peak_ptr`：类型为`uint64`，形状为`(n_pairs + 1,)`。配对`p`对应的峰行范围为`[pair_peak_ptr[p], pair_peak_ptr[p + 1])`，指示所选峰的索引 #### 峰级稀疏表（可变长度）存储峰的评分与位置信息，可与`variant_output_start_bp`结合使用： - `peak_score`：类型为`float16`，形状为`(n_peaks,)`。评分为峰区间内的`log2(sum_alt + 1) - log2(sum_ref + 1)`，其中`sum_alt`为变异等位基因的信号总和，`sum_ref`为参考等位基因的信号总和 - `peak_start_bin`：类型为`uint16`，形状为`(n_peaks,)`，存储峰的包含式起始分箱位置 - `peak_end_bin`：类型为`uint16`，形状为`(n_peaks,)`，存储峰的排除式结束分箱位置 - `peak_summit_bin`：类型为`uint16`，形状为`(n_peaks,)`，存储峰峰顶所在的分箱位置 #### 密集中心窗口评分（固定长度） - `center_mask_lfc`：类型为`float16`，形状为`(n_variants, n_targets)`。单变异/单靶标中心窗口对数倍变化值：`log2(sum_alt_center + 1) - log2(sum_ref_center + 1)`，其中`sum_alt_center`为变异等位基因中心窗口的信号总和，`sum_ref_center`为参考等位基因中心窗口的信号总和 #### 靶标元数据（固定长度） - `target_ids`：字节字符串类型，形状为`(n_targets,)`，存储数值型靶标轨道ID - `target_labels`：字节字符串类型，形状为`(n_targets,)`，存储格式为"TF（细胞类型）"的靶标轨道标签 #### 运行配置元数据（HDF5属性） - `score_type`：语义化评分类型 - `score_definitiion`：评分所用的方法或计算公式 - `peak_threshold`：峰识别所用的倍变化阈值 - `threshold_space`：当取值为`linear_unscaled`时，表示在峰评分前已对预处理过程中的变换进行了逆操作 - `peak_union`：用于指示峰识别是基于参考等位基因、变异等位基因还是两者的组合 - `split_prominence`：拆分重叠峰所需的最大峰高度占比 - `min_peak_bins`：形成一个峰所需的最小分箱数量 - `min_sub_peak_bins`：拆分后形成峰所需的最小分箱数量 - `min_abs_peak_lfc`：峰被纳入输出文件所需的最小评分变化绝对值 - `max_peak_bins`：峰的最大允许分箱数量，若峰过大则脚本会取其中心区域 - `center_mask_bins`：用于计算中心窗口LFC的分箱数量 - `model_resolution`：模型输出预测的分辨率（32bp） - `model_crop`：为得到目标序列长度，对输入序列两端进行的裁剪长度 ## 指针语义对于变异索引`v`： 1. 配对行的范围为`p ∈ [variant_pair_ptr[v], variant_pair_ptr[v + 1])` 2. 配对`p`对应的靶标轨道为`pair_track_idx[p]` 3. 配对`p`对应的峰行范围为`k ∈ [pair_peak_ptr[p], pair_peak_ptr[p + 1])` 4. 峰的属性可从`peak_score[k]`、`peak_start_bin[k]`、`peak_end_bin[k]`以及`peak_summit_bin[k]`中读取 ## 坐标重构借助`model_resolution`和`model_crop`属性，对于变异`v`和峰`k`： - `start_bp = variant_output_start_bp[v] + peak_start_bin[k] * model_resolution` - `end_bp = variant_output_start_bp[v] + peak_end_bin[k] * model_resolution` - `summit_bp = variant_output_start_bp[v] + peak_summit_bin[k] * model_resolution` ## 数据集构建本数据集由`{borzoi_chip或alphagenome_chip}/scripts/score_snps.py`基于以下内容生成： - 已训练的ChIP模型 - 参考基因组序列 - 变异的VCF文件 - 靶标元数据表

提供机构：

tobyclark

5,000+

优质数据集

54 个

任务类型

进入经典数据集