California Housing Data (1990)
收藏www.kaggle.com2018-05-10 更新2025-03-23 收录
下载链接:
https://www.kaggle.com/harrywang/housing
下载链接
链接失效反馈官方服务:
资源简介:
# Source
This is the dataset used in this book: https://github.com/ageron/handson-ml/tree/master/datasets/housing to illustrate a sample end-to-end ML project workflow (pipeline). This is a great book - I highly recommend!
The data is based on California Census in 1990.
### About the Data (from the book):
"This dataset is a modified version of the California Housing dataset available from Luís Torgo's page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
The following is the description from the book author:
This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
The dataset in this directory is almost identical to the original, with two differences:
207 values were randomly removed from the total_bedrooms column, so we can discuss what to do with missing data.
An additional categorical attribute called ocean_proximity was added, indicating (very roughly) whether each block group is near the ocean, near the Bay area, inland or on an island. This allows discussing what to do with categorical data.
Note that the block groups are called "districts" in the Jupyter notebooks, simply because in some contexts the name "block group" was confusing."
### About the Data (From Luís Torgo page):
http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html
This is a dataset obtained from the StatLib repository. Here is the included description:
"We collected information on the variables using all the block groups in California from the 1990 Cens us. In this sample a block group on average includes 1425.5 individuals living in a geographically co mpact area. Naturally, the geographical area included varies inversely with the population density. W e computed distances among the centroids of each block group as measured in latitude and longitude. W e excluded all the block groups reporting zero entries for the independent and dependent variables. T he final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value)."
### End-to-End ML Project Steps (Chapter 2 of the book)
1. Look at the big picture
2. Get the data
3. Discover and visualize the data to gain insights
4. Prepare the data for Machine Learning algorithms
5. Select a model and train it
6. Fine-tune your model
7. Present your solution
8. Launch, monitor, and maintain your system
# The 10-Step Machine Learning Project Workflow (My Version)
1. Define business object
2. Make sense of the data from a high level
- data types (number, text, object, etc.)
- continuous/discrete
- basic stats (min, max, std, median, etc.) using boxplot
- frequency via histogram
- scales and distributions of different features
3. Create the traning and test sets using proper sampling methods, e.g., random vs. stratified
4. Correlation analysis (pair-wise and attribute combinations)
5. Data cleaning (missing data, outliers, data errors)
6. Data transformation via pipelines (categorical text to number using one hot encoding, feature scaling via normalization/standardization, feature combinations)
7. Train and cross validate different models and select the most promising one (Linear Regression, Decision Tree, and Random Forest were tried in this tutorial)
8. Fine tune the model using trying different combinations of hyperparameters
9. Evaluate the model with best estimators in the test set
10. Launch, monitor, and refresh the model and system
此数据集为本书中所使用,旨在展示一个端到端的机器学习项目工作流程(管线)。该书籍广受好评——我强烈推荐!数据基于1990年加利福尼亚的人口普查。以下为书中关于数据集的描述:
本数据集是葡萄牙波尔图大学Luís Torgo教授提供的加利福尼亚住房数据集的修改版。Luís Torgo教授从已关闭的StatLib存储库中获取了该数据集。该数据集也可从StatLib镜像站点下载。
本数据集最初出现在Pace, R. Kelley和Ronald Barry于1997年发表在《统计学与概率通讯》期刊上的论文《稀疏空间自回归》中。他们利用1990年加利福尼亚的人口普查数据构建了此数据集。数据集包含每个普查区块群的一行。区块群是美国人口普查局发布样本数据的最小地理单元(一个区块群通常拥有600至3,000人)。
本目录中的数据集与原始数据集几乎相同,但存在两个差异:
207个值被随机地从total_bedrooms列中移除,以便讨论缺失数据的处理方法。
添加了一个额外的分类属性ocean_proximity,大致指示每个区块群是否靠近海洋、湾区、内陆或岛屿。这允许讨论如何处理分类数据。
请注意,在Jupyter笔记本中,区块群被称为“district”,这仅仅是因为在某些情况下,“block group”这个名字可能会造成混淆。
关于数据的更多信息,请参考Luís Torgo的页面:http://www.dcc.fc.up.pt/%7Eltorgo/Regression/cal_housing.html。该数据集由StatLib存储库获取。以下为存储库中的描述:
我们收集了关于变量的信息,使用了加利福尼亚所有区块群中的1990年人口普查数据。在本样本中,一个区块群平均包含1425.5个居住在地理集中区域的人。显然,包含的地理区域大小与人口密度成反比。我们计算了每个区块群中心点之间的距离,距离以纬度和经度测量。我们排除了所有报告独立变量和因变量为零的区块群。最终数据包含20,640个观测值和9个变量。因变量为房屋中位数价值的自然对数。
本书的第二章中详细介绍了端到端机器学习项目的步骤,包括数据探索、数据准备、模型选择、模型训练、模型微调、模型评估以及系统的部署、监控和维护。
以下为我的机器学习项目工作流程的10个步骤:
1. 定义业务目标
2. 从高层次理解数据,包括数据类型、连续与离散属性、基本统计信息(最小值、最大值、标准差、中位数等)、箱线图、直方图、不同特征的尺度与分布
3. 使用适当的采样方法创建训练集和测试集,例如随机采样与分层采样
4. 进行相关性分析(成对和属性组合分析)
5. 数据清洗(处理缺失数据、异常值、数据错误)
6. 通过管道进行数据转换(将分类文本转换为数值使用独热编码,通过归一化/标准化进行特征缩放,特征组合)
7. 训练和交叉验证不同的模型,选择最有潜力的模型(本教程中尝试了线性回归、决策树和随机森林)
8. 通过尝试不同的超参数组合来微调模型
9. 使用最佳估计量在测试集中评估模型
10. 部署、监控和更新模型及系统。
提供机构:
www.kaggle.com



