pgs-catalog
收藏Hugging Face2026-03-13 更新2026-04-23 收录
下载链接:
https://huggingface.co/datasets/just-dna-seq/pgs-catalog
下载链接
链接失效反馈官方服务:
资源简介:
PGS Catalog数据集是一个完整的PGS Catalog评分文件的镜像,转换为Apache Parquet格式,并包含清理和规范化的元数据表。该数据集由`just-prs`管道自动构建,适用于表格分类和回归任务,特别是在生物学、基因组学和健康领域。数据集包含5,268个评分文件Parquet和5,251个唯一的PGS ID元数据,总数据大小为51.6 GB,基于GRCh38基因组构建。元数据表包括`scores.parquet`、`performance.parquet`和`best_performance.parquet`,分别包含评分、性能评估和最佳性能评估数据。评分文件包含变体级别的效应权重和协调位置。数据集经过列重命名、基因组构建规范化、度量字符串解析、性能扁平化和最佳性能选择等清理步骤。使用示例展示了如何用polars加载元数据和评分文件。数据集来源为PGS Catalog(EBI / NHGRI),采用CC BY 4.0许可,并引用了相关文献。
The PGS Catalog Dataset is a complete mirror of all PGS Catalog scoring files, converted to Apache Parquet format, and includes cleaned and normalized metadata tables. This dataset is automatically built via the `just-prs` pipeline, and is suitable for tabular classification and regression tasks, particularly in the fields of biology, genomics, and health research. The dataset contains 5,268 Parquet-format scoring files and metadata for 5,251 unique PGS IDs, with a total data size of 51.6 GB, and is built based on the GRCh38 genome assembly. The metadata tables include `scores.parquet`, `performance.parquet`, and `best_performance.parquet`, which respectively contain scoring data, performance evaluation data, and optimal performance assessment data. The scoring files contain variant-level effect weights and harmonized genomic coordinates. The dataset has undergone cleaning steps including column renaming, genome assembly normalization, metric string parsing, performance flattening, and optimal performance selection. Usage examples demonstrate how to load metadata and scoring files using Polars. The dataset is sourced from the PGS Catalog (EBI / NHGRI), is licensed under CC BY 4.0, and includes citations to relevant literature.
创建时间:
2026-03-03



