five

Marginal distributions of true DGP as CSV-file.

收藏
Figshare2025-06-02 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Marginal_distributions_of_true_DGP_as_CSV-file_/29217105
下载链接
链接失效反馈
官方服务:
资源简介:
Simulation studies, especially neutral comparison studies, are crucial for evaluating and comparing statistical methods as they investigate whether methods work as intended and can guide an appropriate method choice. Typically, the term simulation refers to parametric simulation, i.e. computer experiments using pseudo-random numbers. For these, the full data-generating process (DGP) and outcome-generating model (OGM) are known within the simulation. However, the specification of realistic DGPs might be difficult in practice leading to oversimplified assumptions. The problem is more severe for higher-dimensional data as the number of parameters to specify typically increases with the number of variables in the data. Plasmode simulation, which is a combination of resampling covariates from a real-life dataset from the DGP of interest together with a specified OGM is often claimed to solve this problem since no explicit specification of the DGP is necessary. However, this claim is not well supported by empirical results. Here, parametric and Plasmode simulations are compared in the context of a method comparison study for binary classification methods. We focus on studies conducted with some specific data type or application in mind whose true, unknown data-generating mechanism is mimicked. The performance of Plasmode and parametric comparison studies for estimating classifier performance is compared as well as their ability to reproduce the true method ranking. The influence of misspecifications of the DGP on the results of parametric simulation and of misspecifications of the OGM on the results of parametric and Plasmode simulation are investigated. Moreover, different resampling strategies are compared for Plasmode comparison studies. The study finds that misspecifications of the DGP and OGM negatively influence the ability of the comparison studies to estimate the classification performances and method rankings. The best choice of the resampling strategy in Plasmode simulation depends on the concrete scenario.

模拟研究,尤其是中立性比较研究,对于评估与对比统计方法至关重要:此类研究可检验方法是否按预期运行,并为方法的合理选择提供指引。通常而言,“模拟”一词指代参数化模拟,即借助伪随机数开展的计算机实验。在此类模拟中,数据生成过程(Data-Generating Process, DGP)与结果生成模型(Outcome-Generating Model, OGM)的完整设定均为已知。然而,在实际操作中,对符合现实场景的DGP进行参数设定往往颇具难度,进而导致假设过于简化。对于高维数据而言,这一问题更为突出,因为需要设定的参数数量通常会随数据中变量的增多而增加。仿真实样本模拟(Plasmode simulation),即结合从目标DGP对应的真实数据集中重抽样协变量与指定OGM的模拟方法,常被认为可解决上述问题,因其无需显式设定DGP。然而,这一主张并未得到实证结果的充分支撑。本研究针对二分类方法的比较研究场景,对参数化模拟与仿真实样本模拟展开对比。我们聚焦于旨在模拟未知真实数据生成机制的特定数据类型与应用场景,对比了两类模拟在评估分类器性能时的表现,以及它们还原真实方法排序的能力。此外,本研究探究了DGP模型误设对参数化模拟结果的影响,以及OGM模型误设对参数化模拟与仿真实样本模拟结果的影响。同时,本研究针对仿真实样本模拟对比研究,对比了不同的重抽样策略。本研究结果表明,DGP与OGM的模型误设会对比较研究评估分类性能与方法排序的能力产生负面影响。仿真实样本模拟中重抽样策略的最优选择,取决于具体的应用场景。
创建时间:
2025-06-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作