The Laws of Anomaly: A Framework for Regression Model Selection Based on a Large Scale Empirical Study of Structural Data Challenges

DataONE2025-10-22 更新2025-11-01 收录

下载链接：

https://search.dataone.org/view/sha256:78ff4eb0e24880d652ffd17dc51507882caef47799715a53bc601ab8b59ec6b7

下载链接

链接失效反馈

官方服务：

资源简介：

The \"ensemble-first\" strategy, while a popular heuristic for tabular regression, lacks a formal framework and fails on specific data challenges. This thesis introduces the Efficiency-Based Model Selection Framework (EMSF), a new methodology that aligns model architecture with a dataset's primary structural challenge. We benchmarked over 20 models across 100 real-world datasets, categorized into four novel cohorts: high row-to-size (computational efficiency), wide data (parameter efficiency), and messy data (data efficiency). This large-scale empirical study establishes three fundamental laws of applied regression. The Law of Ensemble Dominance confirms that ensembles are the most efficient choice in over 70% of standard cases. The Law of Anomaly Supremacy proves the critical exceptions: we provide the first large-scale evidence that K-Nearest Neighbors (KNN) excels on high-dimensional data, and that robust models like the Huber Regressor are \"silver bullet\" solutions for datasets with hidden outliers, winning with performance margins exceeding 1500%. Finally, the Law of Predictive Futility reframes benchmarking as a diagnostic tool for identifying datasets that lack predictive signal. The EMSF provides a practical, evidence-based playbook for practitioners to move beyond a one-size-fits-all approach to model selection.

创建时间：

2025-10-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集