five

红酒的品质,用于回归或分类建模的简单而干净的练习数据集

收藏
帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-13360.html
下载链接
链接失效反馈
官方服务:
资源简介:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.) For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm) KNIME is a great tool (GUI) that can be used for this. 1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA. 2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this: Use machine learning to determine which physiochemical properties make a wine 'good'! This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset. Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

本数据集包含两类与葡萄牙“绿酒(Vinho Verde)”红、白变种相关的数据集。更多细节可参阅文献[Cortez等人,2009]。出于隐私与后勤考量,仅公开了物理化学(输入)与感官(输出)类变量(例如未包含葡萄品种、葡萄酒品牌、售价等数据)。此类数据集可用于分类或回归任务,其类别存在有序性且分布不均衡(例如正常品质的葡萄酒样本量远多于优质与劣质样本)。该数据集同样可从UCI机器学习库(UCI Machine Learning Repository)获取:https://archive.ics.uci.edu/ml/datasets/wine+quality,本次仅为便于使用而分享至Kaggle平台。(若因公开许可协议限制此分享行为,一经请求我将立即移除该数据集。)更多信息请参阅[Cortez等人,2009]。输入变量(基于物理化学检测):1 - 固定酸度2 - 挥发性酸度3 - 柠檬酸含量4 - 残糖量5 - 氯化物含量6 - 游离二氧化硫7 - 总二氧化硫8 - 密度9 - pH值10 - 硫酸盐含量11 - 酒精含量输出变量(基于感官评测数据):12 - 品质(评分区间为0至10)。一个值得尝试的思路是,除使用回归建模外,可对因变量(葡萄酒品质)设置任意阈值,例如将7分及以上的样本归类为“优质/1”,其余样本归类为“非优质/0”。这一设定可用于在决策树等算法中开展超参数调优实践,并通过受试者工作特征曲线(ROC曲线,Receiver Operating Characteristic curve)与曲线下面积(AUC,Area Under Curve)评估模型性能。即便不进行任何特征工程或过拟合防控操作,你也可获得0.88的AUC值(甚至无需使用随机森林算法)。KNIME是一款优秀的图形用户界面(GUI)工具,可用于完成此类任务:1. 文件读取器(针对CSV格式)可连接至线性相关性节点与交互式直方图节点,用于开展基础探索性数据分析(EDA,Exploratory Data Analysis)。2. 文件读取器可连接至“规则引擎节点”,将10分制的品质评分转换为二分类变量(优质葡萄酒与其余样本),可在规则引擎中输入如下代码:使用机器学习方法探究哪些物理化学属性可决定葡萄酒是否为“优质”!该数据集同样可从UCI机器学习库(UCI Machine Learning Repository)获取:https://archive.ics.uci.edu/ml/datasets/wine+quality,本次仅为便于使用而分享至Kaggle平台。(若因公开许可协议限制此分享行为,一经请求我将立即移除该数据集。本人并非该数据集的所有者。)若计划使用该数据库,请务必引用如下文献:P. Cortez, A. Cerdeira, F. Almeida, T. Matos和J. Reis. 《基于物理化学属性的数据挖掘构建葡萄酒偏好模型》,载于《决策支持系统》(Elsevier出版),47(4):547-553, 2009。P. Cortez, A. Cerdeira, F. Almeida, T. Matos和J. Reis. 《基于物理化学属性的数据挖掘构建葡萄酒偏好模型》,载于《决策支持系统》(Elsevier出版),47(4):547-553, 2009。
提供机构:
帕依提提
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集聚焦于葡萄牙Vinho Verde红酒的品质评估,包含物理化学属性(如酸度、酒精含量)作为输入变量和感官质量评分(0-10分)作为输出变量,适用于回归或分类建模练习。数据集结构简单干净,但类别不平衡,适合机器学习初学者用于模型训练和评估,例如通过设置阈值进行二分类任务。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务