SMART-IMPUTE: A Time-Efficient, ANN-Based Algorithm for Practical Imputation with Empirical and Theoretical Validation

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://doi.org/10.7910/DVN/TJ7WRT

下载链接

链接失效反馈

官方服务：

资源简介：

This paper presents a solution to one of the most persistent bottlenecks in data science: the slow and expensive process of handling missing data. Data teams are constantly forced into an undesirable trade-off, choosing between simple methods that are fast but statistically naive, and robust methods (like MICE or KNN) that are accurate but computationally prohibitive, bringing iterative workflows to a halt. This research introduces and validates Smart-Impute, a novel "High-Performance Robust Imputer" architected to resolve this dilemma. It is designed for practitioners who need both state-of-the-art accuracy and high-speed performance. The key findings presented are: Massive Performance Gains: We provide empirical proof from benchmarks on real-world datasets, demonstrating that Smart-Impute is up to 12.2x faster than the standard KNN imputer, turning hour-long processes into minutes. Superior Scalability: We deliver a formal mathematical proof of Smart-Impute's superior O(N log N) time complexity, ensuring its performance scales gracefully as datasets grow. State-of-the-Art Robustness: We demonstrate how the algorithm's architecture natively handles the mixed data types and high-cardinality features that are common in real-world enterprise data. The result is a practical, workhorse algorithm that saves valuable time, increases team productivity, and enables more agile data analysis without sacrificing statistical integrity. This repository contains the full research paper detailing the algorithm, its theoretical proofs, and its empirical validation.

本研究针对数据科学领域长期存在的一类核心瓶颈问题——缺失数据处理流程低效且成本高昂——提出了一种解决方案。数据团队常常被迫陷入一种两难的取舍困境：要么选择快速但统计层面过于简单的简易插补方法，要么选用精准可靠但计算成本极高的稳健方法（如链式多重插补法（Multiple Imputation by Chained Equations, MICE）或K近邻（K-Nearest Neighbors, KNN)），后者往往会导致迭代工作流停滞不前。本研究提出并验证了Smart-Impute——一种专为解决上述两难困境设计的新型“高性能稳健插补器（High-Performance Robust Imputer）”。该工具专为同时需要顶尖插补精度与高速计算性能的从业者打造。本研究的核心发现如下： 1. 性能提升显著：通过基于真实世界数据集的基准测试，我们提供了实证依据，证明Smart-Impute的运行速度最高可达标准K近邻插补器的12.2倍，可将耗时数小时的流程压缩至数分钟内完成。 2. 可扩展性优异：我们通过严格的数学证明，验证了Smart-Impute拥有更优的O(N log N)时间复杂度，确保随着数据集规模扩大，其性能仍能保持平稳增长。 3. 稳健性达业界顶尖水平：我们证明了该算法的架构可原生适配真实企业数据中常见的混合数据类型与高基数特征。最终，本研究得到了一款实用且可靠的核心算法，既能够节省宝贵的时间成本、提升团队工作效率，又可在不牺牲统计严谨性的前提下，支持更敏捷的数据分析工作。本代码仓库包含了详细阐述该算法、其理论证明与实证验证的完整研究论文。

创建时间：

2025-10-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集