Estimating Sampling Selection Bias in Human Genetics: A Phenomenological Approach

NIAID Data Ecosystem2026-03-09 收录

下载链接：

https://figshare.com/articles/dataset/_Estimating_Sampling_Selection_Bias_in_Human_Genetics_A_Phenomenological_Approach_/1572088

下载链接

链接失效反馈

官方服务：

资源简介：

This research is the first empirical attempt to calculate the various components of the hidden bias associated with the sampling strategies routinely-used in human genetics, with special reference to surname-based strategies. We reconstructed surname distributions of 26 Italian communities with different demographic features across the last six centuries (years 1447–2001). The degree of overlapping between "reference founding core" distributions and the distributions obtained from sampling the present day communities by probabilistic and selective methods was quantified under different conditions and models. When taking into account only one individual per surname (low kinship model), the average discrepancy was 59.5%, with a peak of 84% by random sampling. When multiple individuals per surname were considered (high kinship model), the discrepancy decreased by 8–30% at the cost of a larger variance. Criteria aimed at maximizing locally-spread patrilineages and long-term residency appeared to be affected by recent gene flows much more than expected. Selection of the more frequent family names following low kinship criteria proved to be a suitable approach only for historically stable communities. In any other case true random sampling, despite its high variance, did not return more biased estimates than other selective methods. Our results indicate that the sampling of individuals bearing historically documented surnames (founders' method) should be applied, especially when studying the male-specific genome, to prevent an over-stratification of ancient and recent genetic components that heavily biases inferences and statistics.

本研究为首项通过实证手段，拆解并量化人类遗传学领域常规采样策略所潜藏的各类隐性偏差的研究，其中尤以基于姓氏的采样策略（surname-based strategies）为重点考察对象。我们针对过去六个世纪（1447年至2001年）内具备不同人口学特征的26个意大利社区，重构了其姓氏分布情况。本研究在多种条件与模型框架下，量化了「参考奠基核心群体（reference founding core）」的姓氏分布与通过概率性采样、选择性采样当代社区所得分布的重叠程度。当仅统计每个姓氏的单一个体时（低亲缘度模型（low kinship model）），平均偏差率达59.5%，其中随机采样（random sampling）的偏差峰值可达84%。若统计每个姓氏的多个个体（高亲缘度模型（high kinship model）），偏差率可下降8%至30%，但代价是方差显著增大。旨在最大化本地扩散父系血统（locally-spread patrilineages）与长期定居者占比的采样标准，受近期基因流（gene flows）的影响程度远超预期。遵循低亲缘度标准选取高频姓氏的采样方式，仅在历史上人口稳定的社区中才具备适用性。在其余所有场景中，尽管真实随机采样的方差较高，但其所得估计值的偏差程度并未高于其他选择性采样方法。本研究结果表明，应采用选取带有历史可考姓氏个体的采样策略（奠基者法（founders' method）），尤其是在研究男性特异性基因组（male-specific genome）时，以此避免古、近代遗传组分被过度分层，进而严重偏移后续推断与统计结果。

创建时间：

2015-10-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集