Example datasets for the SNP2GPS software
收藏DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20034272
下载链接
链接失效反馈官方服务:
资源简介:
Example datasets for the SNP2GPS software
This dataset accompanies the SNP2GPS software and provides both real-world and synthetic example datasets for genotype-to-geolocation prediction using SNP data.It includes genotypic and geographic metadata for 22,627 barley (Hordeum vulgare) genebank accessions, as well as a synthetic benchmark dataset designed for pipeline testing, validation, and reproducibility.
barley_dataset/
This folder contains real-world genotypic and geographic data from barley genebank accessions.
Genotypic Data
220208_BRIDGE_maf001_geno01.zarrUnfiltered SNP dataset stored in Zarr format, enabling scalable and efficient access to large genomic datasets.Contains the full set of SNP markers prior to downstream filtering.
genotype_data_barley.npzFiltered SNP dataset stored in NumPy compressed format (.npz).Includes quality-controlled SNPs (e.g., filtered by minor allele frequency) and is optimized for direct use in SNP2GPS workflows.
Geographic Metadata
bridge_gbs_long_lat_outliers_corrected_22627.txtPassport-derived latitude and longitude coordinates for all available accessions.Geographic outliers have been removed to improve data consistency.
bridge_gbs_long_lat_outliers_corrected_22627_centroids.txtAugmented geographic dataset including cleaned passport coordinates (outliers removed) and imputed coordinates using country centroids for missing or unreliable entries.This dataset ensures complete spatial coverage for modeling tasks.
synthetic/
This folder contains a simulated dataset for testing, benchmarking, and validating SNP2GPS pipelines under controlled conditions.
Synthetic Genotypic Data
clean_latlon_strong_1000samples_20000snps_FOR_PIPELINE.npzSynthetic SNP dataset containing 1,000 samples and 20,000 SNPs.Generated to exhibit a strong genotype–geography signal, making it suitable for validating model performance and pipeline correctness.
Synthetic Metadata
clean_latlon_strong_1000samples_metadata.tsvCorresponding metadata file containing simulated geographic coordinates (latitude and longitude) for each synthetic sample.Designed to align perfectly with the synthetic genotype dataset for reproducible analyses.
Validation / Ground Truth
known_swapped_cases_truth_table.csvGround truth table defining known swapped or intentionally misassigned samples within the synthetic dataset.This file enables benchmarking of error detection methods, validation of SNP2GPS predictions, and evaluation of model robustness to sample mislabeling.
提供机构:
Zenodo
创建时间:
2026-05-05



