five

SECOM数据集 用于研究生产线测试的简单通过/失败率

收藏
帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-26244.html
下载链接
链接失效反馈
官方服务:
资源简介:
Authors: Michael McCann, Adrian Johnston Data Set Information: A complex modern semi-conductor manufacturing process is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. It is often the case that useful information is buried in the latter two. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. To enhance current business improvement techniques the application of feature selection as an intelligent systems technique is being investigated. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing, figure 2, and associated date time stamp. Where –1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point. Using feature selection techniques it is desired to rank features according to their impact on the overall yield for the product, causal relationships may also be considered with a view to identifying the key features. Results may be submitted in terms of feature relevance for predictability using error rates as our evaluation metrics. It is suggested that cross validation be applied to generate these results. Some baseline results are shown below for basic feature selection techniques using a simple kernel ridge classifier and 10 fold cross validation. baseline Results: Pre-processing objects were applied to the dataset simply to standardize the data and remove the constant features and then a number of different feature selection objects selecting 40 highest ranked features were applied with a simple classifier to achieve some initial results. 10 fold cross validation was used and the balanced error rate (*BER) generated as our initial performance metric to help investigate this dataset. SECOM Dataset: 1567 examples 591 features, 104 fails FSmethod (40 features) BER % True + % True - % S2N (signal to noise) 34.5 +-2.6 57.8 +-5.3 73.1 +2.1 Ttest 33.7 +-2.1 59.6 +-4.7 73.0 +-1.8 Relief 40.1 +-2.8 48.3 +-5.9 71.6 +-3.2 Pearson 34.1 +-2.0 57.4 +-4.3 74.4 +-4.9 Ftest 33.5 +-2.2 59.1 +-4.8 73.8 +-1.8 Gram Schmidt 35.6 +-2.4 51.2 +-11.8 77.5 +-2.3 Attribute Information: Key facts: Data Structure: The data consists of 2 files the dataset file SECOM consisting of 1567 examples each with 591 features a 1567 x 591 matrix and a labels file containing the classifications and date time stamp for each example. As with any real life data situations this data contains null values varying in intensity depending on the individuals features. This needs to be taken into consideration when investigating the data either through pre-processing or within the technique applied. The data is represented in a raw text file each line representing an individual example and the features seperated by spaces. The null values are represented by the 'NaN' value as per MatLab. Relevant Papers: N/A Citation Request: Please refer to the Machine Learning Repository's citation policy

作者:迈克尔·麦坎(Michael McCann)、阿德里安·约翰斯顿(Adrian Johnston) ### 数据集信息 复杂的现代半导体制造流程通常会通过监控传感器或工艺测量点采集的信号/变量实现持续监测。然而在特定的监测系统中,并非所有信号都具备同等的应用价值。实测信号中同时包含有效信息、无关信息与噪声,且有效信息往往掩埋于后两者之中。工程师通常面临的信号数量远多于实际所需。若将每一类信号视为一个特征,则可通过**特征选择(feature selection)**识别最具相关性的信号。随后,工艺工程师可借助这些信号,确定导致后续工艺良率波动的关键因素,这有助于提升工艺吞吐量、缩短学习周期并降低单位生产成本。为优化现有业务改进技术,学界正研究将特征选择作为一种智能系统技术加以应用。 本案例中的数据集代表了这类特征的一个子集:每个样本代表一个带有相关实测特征的生产单元,标签则代表内部生产线测试的简单合格/不合格结果(见图2)与关联的时间戳。其中,-1代表合格,1代表不合格,数据时间戳对应特定测试点。通过特征选择技术,我们希望根据特征对产品整体良率的影响对其进行排序,同时也可考虑因果关系,以识别关键特征。研究结果可通过特征相关性结合误差率作为评估指标来体现,建议通过**交叉验证(cross validation)**生成结果。 #### 基线结果 下方展示了使用简单**核岭分类器(kernel ridge classifier)**与**10折交叉验证(10-fold cross validation)**的基础特征选择技术的基线结果: 预处理环节仅对数据集进行标准化处理、移除恒定特征,随后采用多种不同的特征选择方法,选取排名前40的特征,结合简单分类器得到初始结果。实验采用10折交叉验证,以**平衡错误率(balanced error rate, BER)**作为初始性能指标,用于该数据集的探索研究。 SECOM数据集:共1567个样本,591个特征,其中104个为不合格样本。 以下为采用40个特征的各特征选择方法的性能结果: | 特征选择方法(FSmethod) | 平衡错误率(BER,%) | 真阳性率(True+,%) | 真阴性率(True-,%) | 信噪比(S2N,signal to noise) | |--------------------------|----------------------|----------------------|----------------------|--------------------------------| | S2N | 34.5±2.6 | 57.8±5.3 | 73.1±2.1 | — | | T检验(Ttest) | 33.7±2.1 | 59.6±4.7 | 73.0±1.8 | — | | Relief算法 | 40.1±2.8 | 48.3±5.9 | 71.6±3.2 | — | | 皮尔逊相关系数(Pearson)| 34.1±2.0 | 57.4±4.3 | 74.4±4.9 | — | | F检验(Ftest) | 33.5±2.2 | 59.1±4.8 | 73.8±1.8 | — | | 格拉姆-施密特正交化(Gram Schmidt) | 35.6±2.4 | 51.2±11.8 | 77.5±2.3 | — | ### 属性信息 #### 关键事实 数据包含两个文件:一是SECOM数据集文件,包含1567个样本,每个样本含591个特征,即1567×591的矩阵;二是标签文件,包含每个样本的分类标签与时间戳。 与真实场景下的数据集一致,本数据包含空值,其数量因特征而异,在研究数据时需通过预处理或所采用的算法加以考虑。数据以原始文本文件形式存储,每行代表一个独立样本,特征以空格分隔,空值采用MATLAB标准的'NaN'表示。 ### 相关论文 无 ### 引用要求 请遵循机器学习存储库(Machine Learning Repository)的引用规范。
提供机构:
帕依提提
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
SECOM数据集是一个用于半导体制造生产线测试的工业数据集,包含1567个样本和591个特征,旨在通过特征选择分析生产过程中的通过/失败率(标签为-1和1)。该数据集具有高维特性,且存在缺失值,适用于研究特征选择技术以优化生产成品率,并提供了基线实验结果供参考。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务