five

tobyclark/tf-scores

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tobyclark/tf-scores
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Functional Genomics ChIP SNP Scores task_categories: - other language: - en license: apache-2.0 size_categories: - 10M<n<100M dataset_info: features: - name: snp dtype: string - name: chr dtype: string - name: pos dtype: uint32 - name: ref_allele dtype: string - name: alt_allele dtype: string - name: variant_pair_ptr dtype: uint64 - name: variant_output_start_bp dtype: int64 - name: pair_track_idx dtype: uint16_or_uint32 - name: pair_peak_ptr dtype: uint64 - name: peak_score dtype: float16 - name: peak_start_bin dtype: uint16 - name: peak_end_bin dtype: uint16 - name: peak_summit_bin dtype: uint16 - name: center_mask_lfc dtype: float16 - name: target_ids dtype: string - name: target_labels dtype: string --- # Dataset Card for Borzoi/AlphaGenome ChIP Scores ## Dataset Description This dataset format stores per-variant Borzoi/AlphaGenome ChIP predictions scored from VCF variants. The HDF5 file is created by `{borzoi_chip or alphagenome_chip}/scripts/score_snps.py`, creating shards for a VCF slice `[start_idx, stop_idx]`, then combining. The schema combines: - fixed-size variant metadata, - sparse pointer-based peak score tables, - dense per-variant/per-target centre-window log fold-change values, - run configuration in HDF5 attributes. Note that the peak information is bin-based as the output format is 32bp resolution for Borzoi and 128bp for AlphaGenome, following their papers. ## Dataset Structure ### Data Files - Currently contains the `borzoi_scores.h5` file, containing concatenated scores. ## Data Fields Let: - `n_variants = 19,534,182` - total number of SNPs/indels. - `n_targets = 901` - total number of TFs across HepG2 and K562 ### Variant metadata (fixed size) - `snp`: `S50`, shape `(n_variants,)`, variant ID. - `chr`: `S10`, shape `(n_variants,)`, chromosome. - `pos`: `uint32`, shape `(n_variants,)`, 1-based genomic coordinate for the variant. - `ref_allele`: `S100`, shape `(n_variants,)`, reference allele. - `alt_allele`: `S100`, shape `(n_variants,)`, alternate allele. ### Variant-level pointers (fixed size) A variant row corresponds to all `(variant, target track)` pairs for a given variant. - `variant_pair_ptr`: `uint64`, shape `(n_variants + 1,)`. - Variant `v` maps to pair rows in `[variant_pair_ptr[v], variant_pair_ptr[v + 1])`. - `variant_output_start_bp`: `int64`, shape `(n_variants,)`. - Genomic coordinate of output bin 0 for variant `v`. ### Pair-level sparse table (variable size) A pair row corresponds to a `(variant, target-track)` with at least one retained peak. - `pair_track_idx`: `uint16` or `uint32`, shape `(n_pairs,)`. - Target track index for each pair row - indicates the TF/cell type pair for the peak/set of peaks. - `pair_peak_ptr`: `uint64`, shape `(n_pairs + 1,)`. - Pair `p` maps to peak rows in `[pair_peak_ptr[p], pair_peak_ptr[p + 1])`. - Shows the indices of the selected peak. ### Peak-level sparse table (variable size) Contains scores and location information for the peaks, can be used with `variant_output_start_bp` to - `peak_score`: `float16`, shape `(n_peaks,)`. - Score is `log2(sum_alt + 1) - log2(sum_ref + 1)` over the peak interval. - `peak_start_bin`: `uint16`, shape `(n_peaks,)`, inclusive start bin for the peak. - `peak_end_bin`: `uint16`, shape `(n_peaks,)`, exclusive end bin for the peak. - `peak_summit_bin`: `uint16`, shape `(n_peaks,)`, bin location of the peak summit. ### Dense centre-window scores (fixed size) - `center_mask_lfc`: `float16`, shape `(n_variants, n_targets)`. - Per-variant/per-target centre-window log fold-change: `log2(sum_alt_center + 1) - log2(sum_ref_center + 1)`. ### Target metadata (fixed size) - `target_ids`: byte strings, shape `(n_targets,)` - numeric track IDs - `target_labels`: byte strings, shape `(n_targets,)` - track labels in the format "TF (cell type)" ### Run configuration metadata (h5 attributes) - `score_type` - semantic score type - `score_definitiion` - method/equation for scoring - `peak_threshold` - fold change threshold for peak calling - `threshold_space` - "linear_unscaled" means that transformations in preprocessing were reversed before peak scoring. - `peak_union` - indicates whether peaks were called on reference, alternative or combined - `split_prominence` - proportion of max peak height required to split overlapping peaks - `min_peak_bins` - minimum number of bins to form a peak - `min_sub_peak_bins` - minimum number of bins to form a peak after splitting - `min_abs_peak_lfc` - minimum score change for a peak to be included in the output file. - `max_peak_bins` - maximum size of peak - script uses the middle of the peak if it is too large. - `center_mask_bins` - number of bins to use for centre mask LFC calculation - `model_resolution` - resolution of model output predictions (32bp) - `model_crop` - the crop at each end of input sequence length to produce the target length. ## Pointer Semantics For a variant index `v`: 1. Pair range is `p in [variant_pair_ptr[v], variant_pair_ptr[v + 1])`. 2. Pair `p` uses track `pair_track_idx[p]`. 3. Peak range for `p` is `k in [pair_peak_ptr[p], pair_peak_ptr[p + 1])`. 4. Peak properties are read from `peak_score[k]`, `peak_start_bin[k]`, `peak_end_bin[k]`, and `peak_summit_bin[k]`. ## Coordinate Reconstruction Using attributes `model_resolution` and `model_crop`, for variant `v` and peak `k`: - `start_bp = variant_output_start_bp[v] + peak_start_bin[k] * model_resolution` - `end_bp = variant_output_start_bp[v] + peak_end_bin[k] * model_resolution` - `summit_bp = variant_output_start_bp[v] + peak_summit_bin[k] * model_resolution` ## Dataset Creation Generated by `{borzoi_chip or alphagenome_chip}/scripts/score_snps.py` from: - a trained ChIP model, - reference genome sequence, - a VCF of variants, - target metadata table.

数据集名称:功能基因组学染色质免疫共沉淀(Chromatin Immunoprecipitation, ChIP)单核苷酸多态性(Single Nucleotide Polymorphism, SNP)评分 任务类别:其他 语言:英语 许可证:Apache-2.0 样本规模:1000万<样本数<1亿 数据集信息: 特征: - 名称:snp,数据类型:string - 名称:chr,数据类型:string - 名称:pos,数据类型:uint32 - 名称:ref_allele,数据类型:string - 名称:alt_allele,数据类型:string - 名称:variant_pair_ptr,数据类型:uint64 - 名称:variant_output_start_bp,数据类型:int64 - 名称:pair_track_idx,数据类型:uint16_or_uint32 - 名称:pair_peak_ptr,数据类型:uint64 - 名称:peak_score,数据类型:float16 - 名称:peak_start_bin,数据类型:uint16 - 名称:peak_end_bin,数据类型:uint16 - 名称:peak_summit_bin,数据类型:uint16 - 名称:center_mask_lfc,数据类型:float16 - 名称:target_ids,数据类型:string - 名称:target_labels,数据类型:string # Borzoi/AlphaGenome ChIP评分数据集卡片 ## 数据集描述 本数据集格式用于存储基于变异识别格式(Variant Call Format, VCF)变异得到的单变异Borzoi/AlphaGenome ChIP预测评分。该分层数据格式5(Hierarchical Data Format 5, HDF5)文件由`{borzoi_chip或alphagenome_chip}/scripts/score_snps.py`生成,先为VCF片段`[start_idx, stop_idx]`创建数据分片,再进行合并。 该数据集的结构包含以下内容: - 固定长度的变异元数据 - 基于指针的稀疏峰评分表 - 单变异/单靶标中心窗口对数倍变化(log fold-change, LFC)密集值 - 存储于HDF5属性中的运行配置参数 请注意,本数据集的峰信息基于分箱(bin)进行组织,这是因为按照对应论文的设定,Borzoi的模型输出分辨率为32bp,AlphaGenome则为128bp。 ## 数据集结构 ### 数据文件 目前仅包含`borzoi_scores.h5`文件,其中存储了合并后的评分数据。 ### 数据字段 设: - `n_variants = 19534182`:单核苷酸多态性/插入缺失(Single Nucleotide Polymorphism/Insertion-Deletion, SNP/indel)的总数量 - `n_targets = 901`:覆盖HepG2和K562细胞系的转录因子(Transcription Factor, TF)总靶标数 #### 变异元数据(固定长度) - `snp`:类型为`S50`,形状为`(n_variants,)`,存储变异ID - `chr`:类型为`S10`,形状为`(n_variants,)`,存储染色体编号 - `pos`:类型为`uint32`,形状为`(n_variants,)`,存储变异的1-based基因组坐标 - `ref_allele`:类型为`S100`,形状为`(n_variants,)`,存储参考等位基因 - `alt_allele`:类型为`S100`,形状为`(n_variants,)`,存储变异等位基因 #### 变异级指针(固定长度) 每一行变异对应给定变异的所有`(变异, 靶标轨道)`配对: - `variant_pair_ptr`:类型为`uint64`,形状为`(n_variants + 1,)`。变异`v`对应的配对行范围为`[variant_pair_ptr[v], variant_pair_ptr[v + 1])` - `variant_output_start_bp`:类型为`int64`,形状为`(n_variants,)`,存储变异`v`的输出分箱0对应的基因组坐标 #### 配对级稀疏表(可变长度) 每一行配对对应至少包含一个有效峰的`(变异, 靶标轨道)`组合: - `pair_track_idx`:类型为`uint16`或`uint32`,形状为`(n_pairs,)`。每个配对行对应的靶标轨道索引,用于指示峰/峰集合对应的转录因子(TF)/细胞类型组合 - `pair_peak_ptr`:类型为`uint64`,形状为`(n_pairs + 1,)`。配对`p`对应的峰行范围为`[pair_peak_ptr[p], pair_peak_ptr[p + 1])`,指示所选峰的索引 #### 峰级稀疏表(可变长度) 存储峰的评分与位置信息,可与`variant_output_start_bp`结合使用: - `peak_score`:类型为`float16`,形状为`(n_peaks,)`。评分为峰区间内的`log2(sum_alt + 1) - log2(sum_ref + 1)`,其中`sum_alt`为变异等位基因的信号总和,`sum_ref`为参考等位基因的信号总和 - `peak_start_bin`:类型为`uint16`,形状为`(n_peaks,)`,存储峰的包含式起始分箱位置 - `peak_end_bin`:类型为`uint16`,形状为`(n_peaks,)`,存储峰的排除式结束分箱位置 - `peak_summit_bin`:类型为`uint16`,形状为`(n_peaks,)`,存储峰峰顶所在的分箱位置 #### 密集中心窗口评分(固定长度) - `center_mask_lfc`:类型为`float16`,形状为`(n_variants, n_targets)`。单变异/单靶标中心窗口对数倍变化值:`log2(sum_alt_center + 1) - log2(sum_ref_center + 1)`,其中`sum_alt_center`为变异等位基因中心窗口的信号总和,`sum_ref_center`为参考等位基因中心窗口的信号总和 #### 靶标元数据(固定长度) - `target_ids`:字节字符串类型,形状为`(n_targets,)`,存储数值型靶标轨道ID - `target_labels`:字节字符串类型,形状为`(n_targets,)`,存储格式为"TF(细胞类型)"的靶标轨道标签 #### 运行配置元数据(HDF5属性) - `score_type`:语义化评分类型 - `score_definitiion`:评分所用的方法或计算公式 - `peak_threshold`:峰识别所用的倍变化阈值 - `threshold_space`:当取值为`linear_unscaled`时,表示在峰评分前已对预处理过程中的变换进行了逆操作 - `peak_union`:用于指示峰识别是基于参考等位基因、变异等位基因还是两者的组合 - `split_prominence`:拆分重叠峰所需的最大峰高度占比 - `min_peak_bins`:形成一个峰所需的最小分箱数量 - `min_sub_peak_bins`:拆分后形成峰所需的最小分箱数量 - `min_abs_peak_lfc`:峰被纳入输出文件所需的最小评分变化绝对值 - `max_peak_bins`:峰的最大允许分箱数量,若峰过大则脚本会取其中心区域 - `center_mask_bins`:用于计算中心窗口LFC的分箱数量 - `model_resolution`:模型输出预测的分辨率(32bp) - `model_crop`:为得到目标序列长度,对输入序列两端进行的裁剪长度 ## 指针语义 对于变异索引`v`: 1. 配对行的范围为`p ∈ [variant_pair_ptr[v], variant_pair_ptr[v + 1])` 2. 配对`p`对应的靶标轨道为`pair_track_idx[p]` 3. 配对`p`对应的峰行范围为`k ∈ [pair_peak_ptr[p], pair_peak_ptr[p + 1])` 4. 峰的属性可从`peak_score[k]`、`peak_start_bin[k]`、`peak_end_bin[k]`以及`peak_summit_bin[k]`中读取 ## 坐标重构 借助`model_resolution`和`model_crop`属性,对于变异`v`和峰`k`: - `start_bp = variant_output_start_bp[v] + peak_start_bin[k] * model_resolution` - `end_bp = variant_output_start_bp[v] + peak_end_bin[k] * model_resolution` - `summit_bp = variant_output_start_bp[v] + peak_summit_bin[k] * model_resolution` ## 数据集构建 本数据集由`{borzoi_chip或alphagenome_chip}/scripts/score_snps.py`基于以下内容生成: - 已训练的ChIP模型 - 参考基因组序列 - 变异的VCF文件 - 靶标元数据表
提供机构:
tobyclark
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作