five

The Laws of Anomaly: A Framework for Regression Model Selection Based on a Large Scale Empirical Study of Structural Data Challenges

收藏
DataONE2025-10-22 更新2025-11-01 收录
下载链接:
https://search.dataone.org/view/sha256:78ff4eb0e24880d652ffd17dc51507882caef47799715a53bc601ab8b59ec6b7
下载链接
链接失效反馈
官方服务:
资源简介:
The \"ensemble-first\" strategy, while a popular heuristic for tabular regression, lacks a formal framework and fails on specific data challenges. This thesis introduces the Efficiency-Based Model Selection Framework (EMSF), a new methodology that aligns model architecture with a dataset's primary structural challenge. We benchmarked over 20 models across 100 real-world datasets, categorized into four novel cohorts: high row-to-size (computational efficiency), wide data (parameter efficiency), and messy data (data efficiency). This large-scale empirical study establishes three fundamental laws of applied regression. The Law of Ensemble Dominance confirms that ensembles are the most efficient choice in over 70% of standard cases. The Law of Anomaly Supremacy proves the critical exceptions: we provide the first large-scale evidence that K-Nearest Neighbors (KNN) excels on high-dimensional data, and that robust models like the Huber Regressor are \"silver bullet\" solutions for datasets with hidden outliers, winning with performance margins exceeding 1500%. Finally, the Law of Predictive Futility reframes benchmarking as a diagnostic tool for identifying datasets that lack predictive signal. The EMSF provides a practical, evidence-based playbook for practitioners to move beyond a one-size-fits-all approach to model selection.
创建时间:
2025-10-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作