five

Rescued Phased VCF for GIAB HG001

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7819302
下载链接
链接失效反馈
官方服务:
资源简介:
The GIAB VCF file contained anomalies that elicited run-time errors from both LRphase and WhatsHap. This VCF is noncompliant with the VCF 4.3 specification (https://samtools.github.io/hts-specs/VCFv4.3.pdf) in at least two ways: 1) Phase Set (PS) tags within genotype fields contain strings instead of 32-bit integers. 2) Sample columns in the VCF column header are labeled ”INTEGRATION” rather than containing the sample name. We also found ~25,000 records with malformed genotype records, which contained extra fields not defined in the format string. Finally, we encountered errors from WhatsHap that suggested at least a subset of indel records are also malformed. Neither program would run successfully without correcting these errors, but we were able to rescue the VCF using a custom Python script. Briefly, we transliterated PS tag strings to integer values by concatenating a unique integer with the chromosome number for each record (24, 25, and 26, for chromosomes X, Y, and M, respectively). This ensured that all phase sets have integer labels and only included variants on the same chromosome. We defined an additional format tag, OPS, under which we stored the original PS tag values. Likewise, the “INTEGRATION” label in the column name header was replaced with the sample name, “HG001”. Variants with malformed genotype fields were rescued by removing the fields not defined in the format string. Since we were unable to identify the direct cause for WhatsHap errors related to indel record parsing, we filtered out all records for indels and structural variants, leaving only SNV records in the VCF.
创建时间:
2023-12-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作