five

Example datasets for the SNP2GPS software

收藏
DataCite Commons2026-05-05 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20034272
下载链接
链接失效反馈
官方服务:
资源简介:
Example datasets for the SNP2GPS software This dataset accompanies the SNP2GPS software and provides both real-world and synthetic example datasets for genotype-to-geolocation prediction using SNP data.It includes genotypic and geographic metadata for 22,627 barley (Hordeum vulgare) genebank accessions, as well as a synthetic benchmark dataset designed for pipeline testing, validation, and reproducibility. barley_dataset/ This folder contains real-world genotypic and geographic data from barley genebank accessions. Genotypic Data 220208_BRIDGE_maf001_geno01.zarrUnfiltered SNP dataset stored in Zarr format, enabling scalable and efficient access to large genomic datasets.Contains the full set of SNP markers prior to downstream filtering. genotype_data_barley.npzFiltered SNP dataset stored in NumPy compressed format (.npz).Includes quality-controlled SNPs (e.g., filtered by minor allele frequency) and is optimized for direct use in SNP2GPS workflows. Geographic Metadata bridge_gbs_long_lat_outliers_corrected_22627.txtPassport-derived latitude and longitude coordinates for all available accessions.Geographic outliers have been removed to improve data consistency. bridge_gbs_long_lat_outliers_corrected_22627_centroids.txtAugmented geographic dataset including cleaned passport coordinates (outliers removed) and imputed coordinates using country centroids for missing or unreliable entries.This dataset ensures complete spatial coverage for modeling tasks. synthetic/ This folder contains a simulated dataset for testing, benchmarking, and validating SNP2GPS pipelines under controlled conditions. Synthetic Genotypic Data clean_latlon_strong_1000samples_20000snps_FOR_PIPELINE.npzSynthetic SNP dataset containing 1,000 samples and 20,000 SNPs.Generated to exhibit a strong genotype–geography signal, making it suitable for validating model performance and pipeline correctness. Synthetic Metadata clean_latlon_strong_1000samples_metadata.tsvCorresponding metadata file containing simulated geographic coordinates (latitude and longitude) for each synthetic sample.Designed to align perfectly with the synthetic genotype dataset for reproducible analyses. Validation / Ground Truth known_swapped_cases_truth_table.csvGround truth table defining known swapped or intentionally misassigned samples within the synthetic dataset.This file enables benchmarking of error detection methods, validation of SNP2GPS predictions, and evaluation of model robustness to sample mislabeling.
提供机构:
Zenodo
创建时间:
2026-05-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作