Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.

Name: Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.
Creator: Mendeley Data
License: 暂无描述

doi.org2025-03-22 收录

下载链接：

http://doi.org/10.17632/6cm9wyd5g5.1

下载链接

链接失效反馈

官方服务：

资源简介：

The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

数据挖掘分析之宗旨，恒在于运用诸如分类或回归等特定技术，探寻数据之中之模式。然而，并非总是可行直接将分类算法应用于数据集。在进行任何数据工作之前，数据之预处理不可或缺，此过程通常涉及特征选择与维度缩减。吾等曾尝试以聚类作为一种降低数据维度并创建新特征的途径。基于吾等项目实践，于分类之前先行聚类，其性能提升并不显著。性能未得提升之原因，或在于用于聚类之特征选择并不适宜。鉴于数据之性质，分类任务将提供更多信息以供利用，从而优化知识体系与整体性能指标。从维度缩减之视角而言，此法与主成分分析（Principle Component Analysis）不同，后者保证在最小信息损失的前提下，寻找到最佳的线性变换以减少维度。以聚类作为降低数据维度的技术，由于聚类技术基于‘距离’的度量标准，将导致大量信息丢失。在高维度情况下，欧几里得距离几乎失去所有意义。因此，将聚类视为通过将数据点映射至聚类编号以降低维度，并不总是理想之选，因你可能几乎会丢失所有信息。从创建新特征之角度而言，聚类分析依据数据模式创建标签，从而引入数据之中之不确定性。于分类之前使用聚类，对聚类数量之决策将极大地影响聚类性能，进而影响分类性能。若用于聚类之特征部分非常适合聚类技术，则可能提升分类整体性能。例如，若用于k-means之特征为数值型且维度较小，整体分类性能可能更佳。吾等并未使用random_state锁定聚类输出，以观察其稳定性。吾等之假设为，若结果在每次运行中均有较大差异（实则如此），或许数据根本不适合所选聚类方法。基本上，吾等所见之后果是，在数据预处理中应用聚类时，其结果并不优于随机。最后，确保反馈循环之存在至关重要，以持续收集创建模型所用之相同格式之数据。此反馈循环可用于衡量模型在实际世界中的有效性，并随事物之变化不时修订模型。

提供机构：

Mendeley Data