five

22Danielle/new_paris_housing_dataset

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/22Danielle/new_paris_housing_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Paris Housing Classification — EDA Assignment **Dataset:** [Paris Housing Classification](https://www.kaggle.com/datasets/mssmartypants/paris-housing-classification) · Kaggle This dataset contains records of 10,000 real estate properties in Paris, France. Each property is described by 17 numeric features including size, number of rooms, amenities, price, and year built. **The target variable** is category, a binary label indicating whether a property is Luxury or Basic. --- ## Central Research Question **Can we predict whether a Paris property is Luxury or Basic based on its physical and structural features?** --- ## Outlier Detection (IQR Method) Outliers were detected using values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR. **Decision:** No rows removed. Extreme values in squareMeters and price represent real edge cases (large estates, high-end properties) and carry meaningful signal for classification. ![Unknown](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/78qhWfZSFZkhFwkrPcEVC.png) --- ## Descriptive Statistics Summary | Feature | Luxury (mean) | Basic (mean) | Takeaway | |---|---|---|---| | squareMeters | Much larger | Much smaller | Strongest predictor | | price | Significantly higher | Significantly lower | Second strongest predictor | | numberOfRooms | More rooms | Fewer rooms | Moderate signal | | floors | More floors | Fewer floors | Moderate signal | | numPrevOwners | Similar | Similar | Weak signal | | made (year built) | Similar | Similar | Weak signal | **Key insight:** The large gap in squareMeters and price between the two categories suggests these continuous features will dominate any classification model. Age and ownership history are much less useful for distinguishing property tier. --- ## Correlation Heatmap **Key findings:** - squareMeters has the strongest positive correlation with Luxury classification - price is the second strongest predictor - Binary amenities (pool, yard, garage) each show moderate positive correlation - made and numPrevOwners have very weak correlations — poor predictors ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/B0Fe60GV-YxL-dqoLIIad.png) --- ## Multivariate Exploration — Pairplot No single feature perfectly separates Luxury from Basic, but **combinations create very clean boundaries**. squareMeters vs price produces the clearest cluster separation with minimal overlap. numberOfRooms and floors show moderate separation. The cleaner the boundary between clusters, the easier it is for a classification model to learn the pattern. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/xLAsFS1JNtU8DBacjzuQK.png) --- ## Chi-Square Test + Cramér's V | Feature | p-value | Cramér's V | Strength | |---|---|---|---| | hasPool | < 0.0001 | ~0.45 | Strong | | garage | < 0.0001 | ~0.42 | Strong | | hasYard | < 0.0001 | ~0.40 | Strong | | attic | < 0.0001 | ~0.35 | Moderate | | basement | < 0.0001 | ~0.33 | Moderate | | hasStorageRoom | < 0.0001 | ~0.28 | Moderate | | isNewBuilt | < 0.0001 | ~0.18 | Weak | | hasGuestRoom | < 0.0001 | ~0.15 | Weak | All features are statistically significant. Pool, garage, and yard are the strongest amenity signals. --- ## Research Questions & Answers ### Q1 — What is the class distribution? **Answer:** 50% Luxury, 50% Basic — perfectly balanced. No resampling needed. **Insight:** All comparisons between categories are equally representative. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/rR_Mkv6lKxI4T9lXC0bRk.png) --- ### Q2 — Do Luxury properties have larger square footage? **Answer:** Yes — dramatically. Luxury properties are nearly double the size of Basic ones with minimal overlap. **Insight:** squareMeters is the single most discriminating feature in the dataset. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/lVGvLRXtN9HuLfR0_xqHf.png) --- ### Q3 — Does having a pool or yard signal Luxury? **Answer:** Yes. Properties with a pool or yard are overwhelmingly Luxury, without them, strongly Basic. **Insight:** Amenities work best as a combined signal rather than individually. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/NvCo_wFn76CLjG6KThe2G.png) --- ### Q4 — How does price differ between categories? **Answer:** Luxury properties are significantly more expensive. Some mid range overlap exists after log transformation. **Insight:** Price alone cannot perfectly classify — it must be combined with size and amenities. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/Q0r93_Qx8JDes8AGh_BVw.png) --- ### Q5 — Does room count relate to category? **Answer:** Yes — higher room counts skew toward Luxury, but the relationship is probabilistic. **Insight:** More rooms in a large space = Luxury. More rooms in a small space = subdivided budget housing. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/irgXBftIo0-uMM9KlnUSO.png) --- ### Q6 — Does build year differ between categories? **Answer:** No meaningful difference. Both categories span similar construction eras. **Insight:** Paris has both historic and modern luxury — age is not a useful signal. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/9DON-7Rf3HAR3Hr7iEXra.png) --- ### Q7 — Does total amenity count separate categories? **Answer:** Strongly yes. Properties with 5+ amenities are almost exclusively Luxury, 0–1 amenities lean Basic. **Insight:** amenity_count (engineered feature) may outperform any individual binary amenity in a model. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/SxizhBkKwZYHvfWLUx_0z.png) --- ## Log Transformation squareMeters and price are right skewed. After log(1 + x): - Distributions become symmetric and approximately normal - Skewness drops significantly toward zero - Better suited for linear models and statistical tests ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/tSgCKrY2g3Z1FxbBWEoD1.png) --- ## Dimensionality Reduction — PCA All 14 features compressed to 2 dimensions. Two **clearly separated clusters** confirm that the feature set carries strong classification signal. squareMeters and price are the dominant drivers of the Luxury vs Basic axis. ![image](https://cdn-uploads.huggingface.co/production/uploads/69d3e9fe5e7b5b0842a5c14f/i-NZDvih0mTySlNfQumsL.png) --- ## Key Decisions | Decision | Reason | |---|---| | Kept all outliers | Represent genuine property variation, not data errors | | Log-transformed price & squareMeters | Both features were heavily right-skewed | | Engineered amenity_count | Stronger combined signal than individual binary features | | Encoded target as 0/1 | Required for Pearson correlation analysis | | Retained all 10,000 rows | No data quality issues found | --- ## Final Conclusions 1. **squareMeters** is the dominant predictor — Luxury properties are nearly twice the size of Basic ones 2. **price** is the second strongest signal, with some overlap in the mid-range 3. **Amenity count** is a powerful engineered feature — more amenities = much higher probability of Luxury 4. **Build year** has no meaningful relationship with category 5. **PCA confirms** high separability — a classification model should perform very well on this dataset 6. A combination of size + price + amenities provides the clearest classification boundary --- ## Project Files Below is a complete list of all files used throughout this project: --- ## Dataset Files - **ParisHousingClass.csv.numbers** — Original dataset downloaded from Kaggle - **paris_housing_cleaned.csv** — Cleaned version of dataset --- ## Notebook Files - Danielle_Lachovitz_assignment_1_paris_housing.ipynb - Main notebook containing: - Data loading - Data cleaning - Target variable creation - Full Exploratory Data Analysis (EDA) - Visualizations and insights --- ## Documentation - README.md — Project summary and final results documentation - Presentation Video - https://youtu.be/0ZkKWVlurOg?si=Ql_6CgW5nZY-TtMO --- Author **Danielle Lachovitz** Reichman University - Data Science Track 2026
提供机构:
22Danielle
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作