Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.

Mendeley Data2024-03-27 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/6cm9wyd5g5

下载链接

链接失效反馈

官方服务：

资源简介：

The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

数据挖掘分析（data mining analysis）的核心目标始终是通过分类（classification）、回归（regression）等特定技术挖掘数据中的内在模式。但直接将分类算法应用于数据集往往并不可行。在开展正式数据处理工作前，必须先对原始数据进行预处理（pre-processing），该流程通常涵盖特征选择（feature selection）与降维（dimensionality reduction）两个核心环节。我们曾尝试将聚类（clustering）作为数据降维与构造新特征的手段。结合本次项目实践，在分类任务前引入聚类预处理后，模型性能并未出现显著提升。性能未获改善的原因可能在于，用于聚类的特征并不适配所选的聚类方法。由于数据本身的固有特性，分类任务本身就能为提升知识挖掘效果与整体性能指标提供更多可用信息。从降维视角来看，聚类与主成分分析（Principle Component Analysis）存在本质区别：后者可确保找到最优线性变换，在尽可能降低信息损失的前提下减少维度数量；而以聚类作为降维手段时，由于聚类技术基于“距离”度量，会丢失大量信息。在高维空间中，欧氏距离（euclidean distance）几乎已丧失实际意义。因此，将数据点映射至聚类编号以实现“降维”的做法并不总能奏效，因为此时可能已丢失了绝大部分有效信息。从构造新特征的视角来看，聚类分析会基于数据模式生成标签，这会为数据引入不确定性。在分类任务前使用聚类预处理时，聚类簇数的设定会极大影响聚类效果，进而波及分类性能。若用于聚类的特征本身适配该聚类方法，则有可能提升整体分类性能。例如，若用于K均值（k-means）聚类的特征为数值型且维度较低，整体分类性能或可得到改善。我们并未通过设置随机种子（random_state）固定聚类输出，以验证结果的稳定性。我们的假设是：若多次运行的结果差异显著（实际也确实如此），则说明当前所选聚类方法完全无法对该数据实现良好的聚类效果。总体而言，我们得到的结论是：将聚类应用于数据预处理时，模型性能仅略优于随机猜测。最后，至关重要的一点是，需搭建反馈环路（feedback loop）以持续获取与模型训练时格式一致的同源数据。该反馈环路既可用于评估模型在真实场景中的有效性，也可在环境发生变化时，定期对模型进行迭代优化。

创建时间：

2024-01-23