Transformed Customer Shopping Dataset with Advanced Feature Engineering and Anonymization
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/fnhyc6drm8
下载链接
链接失效反馈官方服务:
资源简介:
This dataset represents a thoroughly transformed and enriched version of a publicly available customer shopping dataset. It has undergone comprehensive processing to ensure it is clean, privacy-compliant, and enriched with new features, making it highly suitable for advanced analytics, machine learning, and business research applications.
The transformation process focused on creating a high-quality dataset that supports robust customer behavior analysis, segmentation, and anomaly detection, while maintaining strict privacy through anonymization and data validation.
➡ Data Cleaning and Preprocessing :
Duplicates were removed. Missing numerical values (Age, Purchase Amount, Review Rating) were filled with medians; missing categorical values labeled “Unknown.” Text data were cleaned and standardized, and numeric fields were clipped to valid ranges.
➡ Feature Engineering :
New informative variables were engineered to augment the dataset’s analytical power. These include:
• Avg_Amount_Per_Purchase: Average purchase amount calculated by dividing total purchase value by the number of previous purchases, capturing spending behavior per transaction.
• Age_Group: Categorical age segmentation into meaningful bins such as Teen, Young Adult, Adult, Senior, and Elder.
• Purchase_Frequency_Score: Quantitative mapping of purchase frequency to annualized values to facilitate numerical analysis.
• Discount_Impact: Monetary quantification of discount application effects on purchases.
• Processing_Date: Timestamp indicating the dataset transformation date for provenance tracking.
➡ Data Filtering :
Rows with ages outside 0–100 were removed. Only core categories (Clothing, Footwear, Outerwear, Accessories) and the top 25% of high-value customers by purchase amount were retained for focused analysis.
➡ Data Transformation :
Key numeric features were standardized, and log transformations were applied to skewed data to improve model performance.
➡ Advanced Features :
Created a category-wise average purchase and a loyalty score combining purchase frequency and volume.
➡ Segmentation & Anomaly Detection :
Used KMeans to cluster customers into four groups and Isolation Forest to flag anomalies.
➡ Text Processing :
Cleaned text fields and added a binary indicator for clothing items.
➡ Privacy :
Hashed Customer ID and removed sensitive columns like Location to ensure privacy.
➡ Validation :
Automated checks for data integrity, including negative values and valid ranges.
This transformed dataset supports a wide range of research and practical applications, including customer segmentation, purchase behavior modeling, marketing strategy development, fraud detection, and machine learning education. It serves as a reliable and privacy-aware resource for academics, data scientists, and business analysts.
创建时间:
2025-07-21



