five

Yoad22/craigslist-used-cars-eda

收藏
Hugging Face2026-04-26 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Yoad22/craigslist-used-cars-eda
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 tags: - tabular - eda - used-cars - craigslist - price-prediction configs: - config_name: default data_files: - path: vehicles_clean.csv split: train --- # Craigslist Used Cars and Trucks: EDA <video src="https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/EDA_video.mp4" controls="controls" style="max-width: 720px;"></video> ## Overview This dataset and notebook contain an **Exploratory Data Analysis (EDA)** of real Craigslist used-car listings scraped across the United States. **Main Question:** *What factors most influence the price of a used car listed on Craigslist?* **Target Variable:** `price` — the seller's asking price for each vehicle listing. ### About the Dataset | Property | Details | |---|---| | **Source** | [Kaggle — Austin Reese](https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data) (scraped from Craigslist) | | **Original Size** | ~426,000 rows x 26 columns | | **Content** | Used car and truck listings across the United States | | **Target Variable** | `price` — the asking price of the vehicle | **Features include:** `year`, `manufacturer`, `condition`, `cylinders`, `fuel`, `odometer`, `transmission`, `drive`, `type`, `paint_color`, `state`, and more. ## Repository Contents | File | Description | |---|---| | `vehicles_clean.csv` | Cleaned dataset after all preprocessing steps | | `Craigslist_Used_Cars_EDA_Final.ipynb` | Full EDA notebook (Google Colab) | | `README.md` | This file | | `presentation.mp4` | 2-3 minute video walkthrough | ## Part 1: Data Cleaning Raw Craigslist data required substantial cleaning. Rather than a single blanket strategy, we matched each column to the approach that fits it best. ### Columns Dropped The following columns were removed as they add no value to price analysis: | Column | Reason | |---|---| | `url`, `region_url`, `image_url` | Links, not useful for analysis | | `description` | Free-text field, too complex to analyze in this EDA | | `VIN` | Unique per car, no predictive signal | | `county` | Almost entirely empty | | `posting_date` | Not in scope for this analysis | | `region` | Replaced by `state` for geographic analysis | | `lat`, `long` | Geographic coordinates, redundant with `state` | Two columns (`id` and `model`) were kept temporarily through the relevant cleaning steps and dropped only once their job was done: `id` protected against false duplicates during deduplication, and `model` enabled the manufacturer recovery described below. ### Missing Values Strategy | Strategy | Applied To | |---|---| | **Drop column** (>50% missing) | Columns with more than half their values missing | | **Drop row** | Rows missing `price`, `year`, or `odometer` | | **Smart Manufacturer Recovery** | Use `model` to recover missing manufacturers before falling back to `unknown` | | **Smart Cylinder Recovery** | Fill missing cylinder values by (manufacturer, model) first, then (manufacturer, year), then manufacturer, then median | | **Generic median fallback** | Safety net for any remaining numeric NaNs | ### Smart Manufacturer Recovery Filling every missing manufacturer with `unknown` would create a large fake brand that pollutes later analysis. Instead, we build a lookup from rows where both `model` and `manufacturer` are present, then use it to recover manufacturers when only the model is known. For example, a listing with `model = civic` becomes a Honda; `model = f-150` becomes a Ford. Only rows missing both fields fall back to `unknown`. The `model` column is kept alive for now because it is also used in the Smart Cylinder Recovery step below. ### Smart Cylinder Recovery The cylinders column is filled in four passes, from most specific to least specific: 1. Most common cylinder count for each `(manufacturer, model)` pair — the most reliable match, since a given model is very consistent on cylinders regardless of year. 2. Most common cylinder count for each `(manufacturer, year)` pair, for rows still missing. 3. Most common cylinder count for the manufacturer alone, as a further fallback. 4. Overall median, as a final safety net. This hierarchy produces much more realistic values than a single dataset-wide median. After this step, the `model` column has finished its job (helping recover both manufacturer and cylinders) and is dropped. ### Unrealistic Value Filters | Column | Filter | Reasoning | |---|---|---| | `price` | 500 to 150,000 dollars | Removes free and erroneous listings while preserving legitimate budget and luxury markets | | `year` | 1990 to 2026 | Realistic range of used cars in active circulation | | `odometer` | Less than 400,000 miles | Above 400K is almost certainly a data entry error | ### Smart Duplicate Detection Because we dropped `id` and `VIN`, the default duplicate check could incorrectly merge two genuinely different listings. We keep `id` through the dedup step, then identify re-posts by matching across nine content fields: `price`, `year`, `manufacturer`, `odometer`, `state`, `condition`, `cylinders`, `fuel`, `paint_color`. Matching all nine by coincidence is implausible, so we treat such pairs as the same car posted twice. `id` is dropped right after this step. ### Outlier Detection After the unrealistic-value filter, we apply IQR-based outlier removal to `price`: - Lower bound = Q1 - 1.5 × IQR - Upper bound = Q3 + 1.5 × IQR Before/after box plots confirm the distribution becomes much cleaner. ### Feature Engineering A new column `car_age` was created from the `year` column: ``` car_age = 2026 - year ``` ## Part 2: Research Questions and Visualizations Guiding question: **What factors determine the price of a used car on Craigslist?** ### Question 1: What does the price distribution look like? ![Price Distribution](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/price_distribution.png) **Insight:** The distribution is **right-skewed**. Most cars are priced at lower values, but a long tail of more expensive vehicles pulls the mean above the median. This is typical of used car markets, where a few luxury or collector cars co-exist with many affordable listings. ### Question 2: Which manufacturers are most commonly listed? ![Most Common Manufacturers](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/most_common_manufacturers.png) **Insight:** American brands — **Ford, Chevrolet, GMC, Dodge** — dominate Craigslist listings. This reflects their popularity in the US market and the sheer volume of American used cars in circulation. ### Question 3: Which manufacturers have the highest average prices? ![Average Price by Manufacturer](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/avg_price_by_manufacturer.png) **Insight:** Even among the most-listed brands, there is a meaningful spread in average price. Brands like **GMC and Ram** tend to skew higher, largely due to trucks, while others cluster at lower price points. Brand alone is already a useful signal of expected price range. ### Question 4: How does vehicle condition affect price? ![Price by Vehicle Condition](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/price_dist_by_vehicle_cond.png) Note: the `unknown` category is excluded from this plot, since it dominates the count as a fallback fill for missing conditions and would obscure the pattern across real condition values. **Insight:** Condition is one of the **strongest price signals** in the dataset. "New" and "like new" vehicles command the highest prices, while "salvage" cars are the cheapest. Salvage cars have been in accidents and written off by insurance companies, significantly reducing their market value. ### Question 5: Is there a relationship between odometer reading and price? ![Odometer vs Price](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/odometer_vs_price.png) **Insight:** There is a clear **negative correlation** between odometer and price. More miles means lower price, exactly as expected from the used-car market. The scatter plot also reveals high variance at low odometer readings, meaning newer low-mileage cars vary much more widely in price than high-mileage ones. ### Question 6: Does fuel type influence price? ![Listings by Fuel Type](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/listings_by_fueltype.png) ![Average Price by Fuel Type](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/avg_price_by_fueltype.png) **Insight:** Gas cars dominate in quantity, but diesel and electric vehicles carry clear price premiums. Diesel engines are mostly found in trucks and commercial vehicles, which ties back to our drivetrain finding that 4WD vehicles are the priciest. Electric vehicles belong to a newer and generally higher-trim market segment, which explains their elevated average price. ### Question 7: How does car age relate to price? ![Average Price by Car Age](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/avg_price_by_age.png) **Insight:** There is a clear **downward trend**. As cars get older, their average price falls. There are interesting bumps at very high ages (30 to 35 years): these are **classic cars**, which can spike in price, revealing a small but real collector-car market even on Craigslist. ### Question 8: Does drive type affect price? ![Price by Drive Type](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/price_distribution_by_drivetype.png) **Insight:** **4WD vehicles have the highest median price**, followed by RWD, then FWD. 4WD is common in trucks and SUVs, and RWD is typical in luxury and sports cars. FWD sedans and hatchbacks dominate the lower price range. ### Question 9: What do the numeric correlations look like overall? ![Correlation Heatmap](https://huggingface.co/datasets/Yoad22/craigslist-used-cars-eda/resolve/main/corr_heatmap.png) **Key Observations:** | Pair | Correlation | Meaning | |---|---|---| | `year` and `price` | Positive | Newer model year means higher price | | `car_age` and `price` | Negative | Older car means lower price | | `odometer` and `price` | Negative | More miles means lower price | | `cylinders` and `price` | Positive | More cylinders means bigger engine and higher price | | `year` and `car_age` | Strong Negative (near -1) | Expected, they are mathematically inverse | All correlations match domain intuition, which gives confidence that the cleaned data is solid. ## Summary of Findings | Factor | Key Finding | |---|---| | **Price Distribution** | Right-skewed. Most cars are affordable; a few luxury cars inflate the mean | | **Top Manufacturers** | Ford and Chevrolet dominate listings; GMC and Ram command the highest average prices | | **Vehicle Condition** | One of the strongest signals. New and like-new cars cost significantly more | | **Odometer** | Clear negative correlation. More miles means lower price | | **Fuel Type** | Diesel and electric vehicles are priced higher on average | | **Car Age** | Strong negative relationship. Older cars are cheaper, with rare classic car price spikes | | **Drive Type** | 4WD vehicles are the most expensive on average | | **Correlations** | Year, odometer, car_age, and cylinders all correlate meaningfully with price | **Conclusion:** Used car pricing on Craigslist is driven by a combination of factors. Condition, age, mileage, and drivetrain are the strongest individual signals. ## Used Car Price Calculator As a practical application of the EDA findings, the notebook includes a simple price calculator. Given five inputs (manufacturer, year, odometer, condition, drive type), the calculator finds similar listings in the cleaned dataset and returns an estimated price, a typical price range, and a qualitative confidence level (High, Medium, or Low) based on how many similar cars were found and how consistent their prices are.
提供机构:
Yoad22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作