The Laws of Anomaly: A Framework for Regression Model Selection Based on a Large Scale Empirical Study of Structural Data Challenges

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://doi.org/10.7910/DVN/VH9JJA

下载链接

链接失效反馈

官方服务：

资源简介：

The "ensemble-first" strategy, while a popular heuristic for tabular regression, lacks a formal framework and fails on specific data challenges. This thesis introduces the Efficiency-Based Model Selection Framework (EMSF), a new methodology that aligns model architecture with a dataset's primary structural challenge. We benchmarked over 20 models across 100 real-world datasets, categorized into four novel cohorts: high row-to-size (computational efficiency), wide data (parameter efficiency), and messy data (data efficiency). This large-scale empirical study establishes three fundamental laws of applied regression. The Law of Ensemble Dominance confirms that ensembles are the most efficient choice in over 70% of standard cases. The Law of Anomaly Supremacy proves the critical exceptions: we provide the first large-scale evidence that K-Nearest Neighbors (KNN) excels on high-dimensional data, and that robust models like the Huber Regressor are "silver bullet" solutions for datasets with hidden outliers, winning with performance margins exceeding 1500%. Finally, the Law of Predictive Futility reframes benchmarking as a diagnostic tool for identifying datasets that lack predictive signal. The EMSF provides a practical, evidence-based playbook for practitioners to move beyond a one-size-fits-all approach to model selection.

创建时间：

2025-10-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集