five

shiragraiver/nigerian-retail-sample-30k-eda

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/shiragraiver/nigerian-retail-sample-30k-eda
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en size_categories: - 10K<n<100K --- # Nigerian Retail and E-Commerce Customer Segmentation – EDA Project ## Project Overview This project presents an Exploratory Data Analysis (EDA) of the **Nigerian Retail and E-Commerce Customer Segmentation Data** dataset. The goal of the project was to explore the structure of the data, identify important patterns, and examine whether customer total spending can be predicted using key numerical variables. ## Video Overview <video src="https://huggingface.co/datasets/shiragraiver/nigerian-retail-sample-30k-eda/resolve/main/data_science_video.mp4" controls="controls" style="max-width: 720px;"></video> ## Dataset Description This project uses the **Nigerian Retail and E-Commerce Customer Segmentation Data** dataset from **Hugging Face**. The original dataset contains **150,000 rows and 11 features**, but for this project I used a random sample of **30,000 rows**. The dataset includes numerical features such as `avg_order_value_ngn`, `total_orders`, `total_spend_ngn`, `last_purchase_days_ago`, and `lifetime_value_ngn`, as well as categorical features such as `segment`, `purchase_frequency`, `churn_risk`, `preferred_category`, and `seasonal_buyer`. ## Research Question **Can customer total spending be predicted using numerical variables such as average order value, total number of orders, and days since the last purchase?** ## Target Variable **`total_spend_ngn`** ## EDA Summary The dataset was checked for duplicates, missing values, outliers, skewness, and relationships between variables. No missing values were found in the sampled dataset. Outliers were found mainly in `avg_order_value_ngn` and `total_spend_ngn`, and were treated using capping. A log transformation was applied to the main spending-related variables in order to reduce skewness and improve interpretability. ## Research Questions and Answers ### 1. Is total spending related to average order value? Yes. The analysis showed a clear positive relationship between `avg_order_value_ngn` and `total_spend_ngn`, suggesting that customers with higher average order values also tend to have higher total spending. ### 2. Is total spending related to the total number of orders? The relationship appears relatively weak. High and low spending values appear across almost the entire range of order counts, suggesting that `total_orders` alone is not a strong predictor of total spending. ### 3. Is total spending related to days since the last purchase? No clear relationship was found. Spending values remain widely scattered across the full range of `last_purchase_days_ago`, suggesting that purchase recency alone is not a strong predictor of total spending. ### 4. Which numerical variable is most strongly related to total spending? `avg_order_value_ngn` was by far the strongest numerical variable related to `total_spend_ngn`, while the other numerical variables had very weak correlations. ## Key Insights - `avg_order_value_ngn` is the most informative numerical variable for understanding customer total spending. - `total_orders` and `last_purchase_days_ago` are much less informative on their own. - Spending-related variables contained outliers and skewness, so capping and log transformation were important preprocessing decisions. - The sampled dataset is complete and contains no missing values. ## Main Decisions - Used a random sample of **30,000 rows** - Kept all rows and treated outliers using **capping** - Applied **log transformation** to skewed spending variables - Focused the analysis mainly on numerical features relevant to predicting `total_spend_ngn` ## Limitations - The dataset is **synthetic**, so it may not fully reflect real-world customer behavior. - The analysis is exploratory and does not include a predictive model yet. ## Next Steps A possible next step would be to build a **regression model** to test how accurately customer total spending can be predicted using the main numerical variables. ## Files Included - `dataset_sample_30k.csv` – sampled dataset used in this project - `eda_notebook.ipynb` – full EDA notebook - `README.md` – project summary
提供机构:
shiragraiver
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作