Big Data Model Building Using Dimension Reduction and Sample Selection
收藏DataCite Commons2023-11-15 更新2024-08-18 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Big_Data_Model_Building_using_Dimension_Reduction_and_Sample_Selection/24233113/2
下载链接
链接失效反馈官方服务:
资源简介:
It is difficult to handle the extraordinary data volume generated in many fields with current computational resources and techniques. This is very challenging when applying conventional statistical methods to big data. A common approach is to partition full data into smaller subdata for purposes such as training, testing, and validation. The primary purpose of training data is to represent the full data. To achieve this goal, the selection of training subdata becomes pivotal in retaining essential characteristics of the full data. Recently, several procedures have been proposed to select “optimal design points” as training subdata under pre-specified models, such as linear regression and logistic regression. However, these subdata will not be “optimal” if the assumed model is not appropriate. Furthermore, such subdata cannot be useful to build alternative models because it is not an appropriate representative sample of the full data. In this article, we propose a novel algorithm for better model building and prediction via a process of selecting a “good” training sample. The proposed subdata can retain most characteristics of the original big data. It is also more robust that one can fit various response model and select the optimal model. Supplementary materials for this article are available online.
当前计算资源与技术手段难以应对诸多领域产生的超大规模数据体量,将传统统计方法应用于大数据场景时极具挑战。常规解决方案是将全量数据集划分为若干小型子数据集,以用于训练、测试与验证等任务。训练子集的核心目标在于表征全量数据的整体特征,因此,训练子集的选取成为保留全量数据核心属性的关键环节。近年来,已有多种方法被提出,可在线性回归(linear regression)、逻辑回归(logistic regression)等预设模型框架下,选取"最优设计点"作为训练子集。然而,若预设模型本身与实际场景并不适配,这类训练子集便不再具备最优性;此外,由于此类子集无法恰当代表全量数据的分布特征,也难以用于构建替代模型。本文提出一种全新算法,通过选取"优质"训练样本的流程,实现更优的模型构建与预测效果。所提出的子数据集可保留原始大数据的绝大多数核心特征,同时具备更强的稳健性:研究人员可拟合多种响应模型并从中选取最优模型。本文的补充材料可在线获取。
提供机构:
Taylor & Francis
创建时间:
2023-11-15



