Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/6cm9wyd5g5
下载链接
链接失效反馈官方服务:
资源简介:
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics.
From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information.
From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better.
We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing.
Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
本数据挖掘分析的核心目标,始终是借助分类、回归等特定技术挖掘数据的内在模式。但直接将分类算法应用于数据集往往并非可行之举。在开展数据处理工作前,必须先对数据进行预处理,该流程通常涵盖特征选择与降维环节。本次研究尝试以聚类作为数据降维与构建新特征的手段,但经项目实践发现,在分类任务前引入聚类步骤后,模型性能并未得到显著提升。究其原因,可能是用于聚类的特征选择并不适配该聚类方法。鉴于数据集本身的特性,分类任务本身便能在提升认知水平与优化综合性能指标方面提供更多可用信息。
从降维视角来看:该方法与主成分分析(Principle Component Analysis)存在本质区别——后者可确保找到最优线性变换,在最小化信息损失的前提下实现维度削减。而以聚类作为数据降维技术,则会流失大量信息,这是由于聚类方法均基于“距离”度量实现。在高维空间中,欧氏距离(euclidean distance)几乎已丧失实际意义。因此,将数据点映射至聚类编号以实现“降维”的聚类方法往往并非最优选择,此时可能几乎丢失全部有效信息。
从构建新特征的视角而言:聚类分析会基于数据模式生成标签,这会为数据引入不确定性。若在分类前引入聚类步骤,聚类数量的选择将极大影响聚类效果,进而波及分类性能。仅当用于聚类的特征高度适配该方法时,才有可能提升整体分类性能。例如,若用于K均值(k-means)聚类的特征为数值型且维度较低,则整体分类性能或可得到改善。
本次研究未通过设置随机种子(random_state)固定聚类输出,旨在验证聚类结果的稳定性。我们的假设是:若多次运行结果差异显著(实际也确实如此),则说明所选聚类方法根本无法很好地适配该数据集。最终观测到的结果是:当将聚类应用于数据预处理环节时,模型性能仅略优于随机猜测。
最后,需确保建立反馈循环,以持续获取与模型训练时格式一致的同源数据。该反馈循环既可用于评估模型在真实场景中的有效性,也可根据现实变化定期对模型进行迭代优化。
创建时间:
2018-11-14



