five

ADAMlam-16/coffee-data_eda-project

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ADAMlam-16/coffee-data_eda-project
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - tabular-classification configs: - config_name: default data_files: - split: train path: "coffee data.csv" --- # Coffee Intake & General Health — Exploratory Data Analysis (EDA) ## 1. Objective This project investigates whether a relationship exists between coffee intake levels and key health indicators — sleep, BMI, heart rate, and stress — using a 10,000-person synthetic health dataset sourced from Kaggle. The analysis follows a structured EDA pipeline: cleaning, transformation, segmentation, visualization, and correlation testing. --- ## 2. Dataset Overview The dataset contains individual-level health and lifestyle measurements used to explore the relationship between coffee consumption and general health outcomes. - **Raw size:** 10,000 rows × 16 features - **Clean size:** ~9,400 rows × 15 features (after filtering + column removal) ### Feature Categories | Feature | Type | Description | |---|---|---| | Age | Numeric | Age of the individual | | Gender | Categorical | Male / Female | | Country | Categorical | Country of origin | | Coffee_Intake | Numeric | Daily coffee consumption (cups/day) | | Caffeine_mg | Numeric | Estimated daily caffeine intake (mg) | | Sleep_Hours | Numeric | Average nightly sleep duration | | Sleep_Quality | Categorical | Self-reported sleep quality | | BMI | Numeric | Body Mass Index | | Heart_Rate | Numeric | Resting heart rate (BPM) | | Stress_Level | Categorical | Self-reported stress level (Low / Medium / High) | | Physical_Activity_Hours | Numeric | Weekly physical activity hours | | Occupation | Categorical | Occupation category | | Smoking | Numeric | Smoking indicator | | Alcohol_Consumption | Numeric | Alcohol consumption level | | Coffee_Drinker_Type | Categorical | Engineered group: Low / Moderate / High (quantile-based) | #### Target Variable **Coffee_Drinker_Type** — an engineered categorical variable created by splitting the continuous `Coffee_Intake` column into three quantile-based groups of equal size: - **Low** — bottom third of daily intake - **Moderate** — middle third - **High** — top third This variable is used throughout as the primary grouping factor for health comparison. --- ## 3. Methodology ### 3.1 Data Cleaning #### Column Removal The `Health_Issues` column was removed from the dataset. Upon review, it was found to be inconsistently populated and lacked a clear, reliable definition compared to the other health metrics available. Removing it preserved data integrity without meaningful information loss. #### Invalid Value Filtering Exploratory checks using `df.describe()` revealed that `Sleep_Hours` contained values as low as 3 hours per night. This was identified as implausible for a population-level health dataset and inconsistent with human physiology. **Decision:** All rows where `Sleep_Hours < 5` were removed. ```python before = len(df) df = df[df['Sleep_Hours'] >= 5].reset_index(drop=True) after = len(df) print(f"Removed {before - after} rows with Sleep_Hours < 5") ``` This filtering was applied before re-engineering the `Coffee_Drinker_Type` groups to ensure the group boundaries reflected the cleaned population, not the raw one. #### Duplicate Check ```python print(df.duplicated().sum()) # → 0 ``` No duplicate rows were found. ### 3.2 Feature Engineering `Coffee_Drinker_Type` was created using quantile-based binning (`pd.qcut`, q=3), which ensures equal sample sizes across groups and makes group comparisons statistically fair. Equal-width binning was evaluated but rejected because coffee intake is right-skewed — equal-width bins would have produced very uneven groups. ```python df['Coffee_Drinker_Type'] = pd.qcut( df['Coffee_Intake'], q=3, labels=['Low', 'Moderate', 'High'] ) ``` ### 3.3 Group Distribution After cleaning and re-binning, the three groups are near-equal in size, confirming the quantile approach worked as intended. ![Group Distribution](חלוקה_לקבוצות_צריכה_EDA_.png) --- ## 4. Exploratory Data Analysis (EDA) ### Research Question **Is there a measurable relationship between high coffee intake and general health outcomes?** --- ### Research Question 1 #### Does coffee intake level affect sleep duration? **Observation:** - A clear downward trend is visible across groups: Low intake has the highest median sleep hours, High intake has the lowest. - The overlap between groups is present, indicating that coffee is one of several factors — not the only driver of sleep duration. **Correlation Analysis:** To move beyond group comparison and test whether the relationship holds continuously, a Pearson correlation and regression analysis was run directly on the raw `Coffee_Intake` variable: ![Sleep vs Coffee Correlation](שעות_שינה_EDA.png) **Statistical Result:** | Metric | Value | |---|---| | Pearson r | −0.17 | | Direction | Negative | | Strength | Weak | | p-value | < 0.001 | **Insight:** The negative correlation (r = −0.17, p < 0.001) is statistically significant and consistent across all three intake groups. Higher coffee consumption is weakly associated with reduced sleep duration. The relationship is real but modest — coffee intake is one contributor among several factors affecting sleep. **Conclusion:** Coffee intake has a weak but statistically significant negative association with sleep hours. It is the clearest directional signal in the dataset. --- ### Research Question 2 #### Is BMI associated with coffee intake level? **Observation:** - The boxplots show nearly identical medians and IQRs across all three groups — the distributions are visually almost indistinguishable. - Outliers appear equally across all three groups, consistent with a real-world population. - Unlike sleep, there is no visible directional shift in BMI from Low → High intake. ![BMI Boxplot](BMI_EDA_.png) **Insight:** The three groups show virtually no difference in BMI distribution. This is a meaningful null finding — it suggests that coffee intake, at least in this dataset, does not meaningfully track with body weight. BMI is driven by many lifestyle variables that are not captured here. **Conclusion:** BMI shows no meaningful association with coffee intake level. This variable does not carry a useful signal for the research question. --- ### Research Question 3 #### Does coffee intake affect resting heart rate? **Observation:** - The violin plots reveal that the distribution of resting heart rate shifts rightward as coffee intake increases. - The Low intake group is more tightly centered around a lower BPM, while the High intake group shows both a higher center and a wider spread. - The violin shape for the High group suggests greater variability — some individuals are unaffected while others show notable elevation. ![Heart Rate Violin Plot](HART_RATE_EDA_.png) **Insight:** The High intake group shows a measurable upward shift in resting heart rate, consistent with caffeine's known stimulant effect. The widening of the distribution in the High group is analytically interesting — it may reflect individual differences in caffeine sensitivity or confounding lifestyle variables. **Conclusion:** Coffee intake is associated with modestly elevated resting heart rate. The effect is visible in distribution shape rather than median alone, which is why a violin plot was chosen over a boxplot for this metric. --- ### Research Question 4 #### Does coffee intake influence stress level distribution? **Observation:** - The bar chart shows that the vast majority of respondents in all groups report Low stress (70–82%), with Medium stress accounting for 18–29%. - High stress is reported by very few respondents overall: 0.5% (Low), 0.7% (Moderate), and 1.1% (High intake). - Despite the small absolute numbers, the directional pattern is consistent — as coffee intake increases, the Low-stress share shrinks (81.8% → 70.3%) and the High-stress share grows (0.5% → 1.1%). ![Stress Distribution](STRESS_LEVEL_EDA_.png) **Insight:** The stress composition shift from Low → High intake is consistent and directional. Whether this reflects a causal relationship (caffeine increasing stress) or a selection effect (already-stressed individuals consuming more coffee) cannot be determined from observational data alone. Both are plausible mechanisms. **Conclusion:** Coffee intake is associated with a higher proportion of high-stress reporters and a lower proportion of low-stress reporters. This is one of the most visually clear findings in the dataset. --- ### Research Question 5 #### Who are the high-stress reporters, and how are they distributed across coffee groups? **Observation:** - Absolute counts of high-stress reporters: Low intake n=16 (0.5% of group), Moderate n=20 (0.7%), High n=31 (1.1%). - Although the absolute numbers are small, the High intake group has nearly double the rate of high-stress reporters compared to the Low group. - The age distributions across all three groups are broadly similar, with median ages in the low-to-mid 30s and comparable spreads. ![High Stress Deep Dive](HUGE_UNTAKE_GROUP_EDA_.png) **Insight:** The high-stress signal in the High intake group is not concentrated in a specific age bracket — it is distributed across the full age range. This makes it less likely to be a confound driven purely by age and more consistent with a genuine relationship between intake level and reported stress. **Conclusion:** High-intake coffee drinkers are overrepresented among high-stress reporters across all age groups, strengthening the stress finding from Research Question 4. --- ### Research Question 6 #### Within the high-stress group, is there a relationship between coffee intake and age? **Observation:** - The scatter plot of coffee intake vs. age within the high-stress group shows a very slight positive trend. ### Research Question 6 #### What has a bigger effect on high stress — age or coffee intake? **Observation:** - The chart directly compares the absolute correlation strength of Age (r = 0.008) and Coffee Intake (r = 0.041) with high-stress reporting within the high-stress group. - Coffee Intake is roughly 5× stronger than Age as a predictor of high-stress status. - Both effects are small in absolute terms, but the comparison clearly identifies coffee intake as the dominant factor between the two. ![Effect Size: Age vs Coffee on High Stress](CORRLATION_LEVEL_RDA_.png) **Statistical Result:** | Metric | Value | |---|---| | Age effect (absolute r) | 0.008 | | Coffee Intake effect (absolute r) | 0.041 | | Dominant factor | Coffee Intake (5× stronger than Age) | **Insight:** Age contributes virtually no signal (r = 0.008) while coffee intake, though still weak in absolute terms (r = 0.041), is meaningfully larger by comparison. This rules out age as a confounding explanation for the stress pattern observed across coffee groups. **Decision:** Age is not a meaningful confound. The stress finding is attributed to coffee intake level, not to demographic age differences between groups. **Conclusion:** Coffee intake is a stronger predictor of high-stress reporting than age within this group. Age can be safely excluded as an alternative explanation for the stress findings. --- ## 5. Correlation Structure | Variable | Correlation with Coffee_Intake | Direction | |---|---|---| | Sleep_Hours | −0.17 | Negative (weak) | | Heart_Rate | ~+0.18 | Positive (weak) | | BMI | ~+0.02 | Negligible | | Stress_Score | ~+0.04 | Negligible | | Age | ~+0.008 | Negligible | **Key finding:** Sleep and heart rate are the two variables most associated with coffee intake in this dataset. BMI, stress score, and age show negligible linear correlations. --- ## 6. Final Conclusion This analysis shows that high coffee intake is associated with a consistent directional pattern across health indicators: less sleep, modestly elevated heart rate, and a higher share of high-stress reporters. The effect is clearest for **sleep** (r = −0.17, p < 0.001) and **heart rate**, while BMI shows no meaningful association. No single metric shows a dramatic signal, but the directional consistency across independent outcomes — sleep, heart rate, and stress all pointing the same way — is the most meaningful finding in the dataset. ### Key Takeaways | Health Metric | Finding | Signal Strength | |---|---|---| | Sleep Hours | Negative association — High group sleeps less per night | Weak (r = −0.17) | | Heart Rate | Positive association — High group shows elevated BPM | Weak | | BMI | No meaningful difference across groups | Negligible | | Stress Level | High group has highest share of high-stress reporters | Small but consistent | | Age | No meaningful relationship with intake level | Negligible | --- ## 7. Limitations - The dataset is **synthetic** — relationships are plausible but may not reflect real-world causal mechanisms. - The analysis is **observational** — no causal claims can be made. Directionality (does coffee cause poor sleep, or do poor sleepers drink more coffee?) cannot be determined here. - `Coffee_Drinker_Type` groups are constructed from quantile splits; different binning schemes may shift group boundaries slightly. - Self-reported variables (Stress_Level, Sleep_Quality) are subject to measurement noise. - Confounders such as occupation, smoking, and alcohol consumption were not controlled for in the univariate analyses presented here. --- ## 8. Notebook & Plots Full analysis with code: [Google Colab Notebook](notebook.ipynb)
提供机构:
ADAMlam-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作