ADAMlam-16/coffee-data_eda-project
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ADAMlam-16/coffee-data_eda-project
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- tabular-classification
configs:
- config_name: default
data_files:
- split: train
path: "coffee data.csv"
---
# Coffee Intake & General Health — Exploratory Data Analysis (EDA)
## 1. Objective
This project investigates whether a relationship exists between coffee intake levels and key health indicators — sleep, BMI, heart rate, and stress — using a 10,000-person synthetic health dataset sourced from Kaggle. The analysis follows a structured EDA pipeline: cleaning, transformation, segmentation, visualization, and correlation testing.
---
## 2. Dataset Overview
The dataset contains individual-level health and lifestyle measurements used to explore the relationship between coffee consumption and general health outcomes.
- **Raw size:** 10,000 rows × 16 features
- **Clean size:** ~9,400 rows × 15 features (after filtering + column removal)
### Feature Categories
| Feature | Type | Description |
|---|---|---|
| Age | Numeric | Age of the individual |
| Gender | Categorical | Male / Female |
| Country | Categorical | Country of origin |
| Coffee_Intake | Numeric | Daily coffee consumption (cups/day) |
| Caffeine_mg | Numeric | Estimated daily caffeine intake (mg) |
| Sleep_Hours | Numeric | Average nightly sleep duration |
| Sleep_Quality | Categorical | Self-reported sleep quality |
| BMI | Numeric | Body Mass Index |
| Heart_Rate | Numeric | Resting heart rate (BPM) |
| Stress_Level | Categorical | Self-reported stress level (Low / Medium / High) |
| Physical_Activity_Hours | Numeric | Weekly physical activity hours |
| Occupation | Categorical | Occupation category |
| Smoking | Numeric | Smoking indicator |
| Alcohol_Consumption | Numeric | Alcohol consumption level |
| Coffee_Drinker_Type | Categorical | Engineered group: Low / Moderate / High (quantile-based) |
#### Target Variable
**Coffee_Drinker_Type** — an engineered categorical variable created by splitting the continuous `Coffee_Intake` column into three quantile-based groups of equal size:
- **Low** — bottom third of daily intake
- **Moderate** — middle third
- **High** — top third
This variable is used throughout as the primary grouping factor for health comparison.
---
## 3. Methodology
### 3.1 Data Cleaning
#### Column Removal
The `Health_Issues` column was removed from the dataset. Upon review, it was found to be inconsistently populated and lacked a clear, reliable definition compared to the other health metrics available. Removing it preserved data integrity without meaningful information loss.
#### Invalid Value Filtering
Exploratory checks using `df.describe()` revealed that `Sleep_Hours` contained values as low as 3 hours per night. This was identified as implausible for a population-level health dataset and inconsistent with human physiology.
**Decision:** All rows where `Sleep_Hours < 5` were removed.
```python
before = len(df)
df = df[df['Sleep_Hours'] >= 5].reset_index(drop=True)
after = len(df)
print(f"Removed {before - after} rows with Sleep_Hours < 5")
```
This filtering was applied before re-engineering the `Coffee_Drinker_Type` groups to ensure the group boundaries reflected the cleaned population, not the raw one.
#### Duplicate Check
```python
print(df.duplicated().sum()) # → 0
```
No duplicate rows were found.
### 3.2 Feature Engineering
`Coffee_Drinker_Type` was created using quantile-based binning (`pd.qcut`, q=3), which ensures equal sample sizes across groups and makes group comparisons statistically fair. Equal-width binning was evaluated but rejected because coffee intake is right-skewed — equal-width bins would have produced very uneven groups.
```python
df['Coffee_Drinker_Type'] = pd.qcut(
df['Coffee_Intake'], q=3, labels=['Low', 'Moderate', 'High']
)
```
### 3.3 Group Distribution
After cleaning and re-binning, the three groups are near-equal in size, confirming the quantile approach worked as intended.

---
## 4. Exploratory Data Analysis (EDA)
### Research Question
**Is there a measurable relationship between high coffee intake and general health outcomes?**
---
### Research Question 1
#### Does coffee intake level affect sleep duration?
**Observation:**
- A clear downward trend is visible across groups: Low intake has the highest median sleep hours, High intake has the lowest.
- The overlap between groups is present, indicating that coffee is one of several factors — not the only driver of sleep duration.
**Correlation Analysis:**
To move beyond group comparison and test whether the relationship holds continuously, a Pearson correlation and regression analysis was run directly on the raw `Coffee_Intake` variable:

**Statistical Result:**
| Metric | Value |
|---|---|
| Pearson r | −0.17 |
| Direction | Negative |
| Strength | Weak |
| p-value | < 0.001 |
**Insight:**
The negative correlation (r = −0.17, p < 0.001) is statistically significant and consistent across all three intake groups. Higher coffee consumption is weakly associated with reduced sleep duration. The relationship is real but modest — coffee intake is one contributor among several factors affecting sleep.
**Conclusion:**
Coffee intake has a weak but statistically significant negative association with sleep hours. It is the clearest directional signal in the dataset.
---
### Research Question 2
#### Is BMI associated with coffee intake level?
**Observation:**
- The boxplots show nearly identical medians and IQRs across all three groups — the distributions are visually almost indistinguishable.
- Outliers appear equally across all three groups, consistent with a real-world population.
- Unlike sleep, there is no visible directional shift in BMI from Low → High intake.

**Insight:**
The three groups show virtually no difference in BMI distribution. This is a meaningful null finding — it suggests that coffee intake, at least in this dataset, does not meaningfully track with body weight. BMI is driven by many lifestyle variables that are not captured here.
**Conclusion:**
BMI shows no meaningful association with coffee intake level. This variable does not carry a useful signal for the research question.
---
### Research Question 3
#### Does coffee intake affect resting heart rate?
**Observation:**
- The violin plots reveal that the distribution of resting heart rate shifts rightward as coffee intake increases.
- The Low intake group is more tightly centered around a lower BPM, while the High intake group shows both a higher center and a wider spread.
- The violin shape for the High group suggests greater variability — some individuals are unaffected while others show notable elevation.

**Insight:**
The High intake group shows a measurable upward shift in resting heart rate, consistent with caffeine's known stimulant effect. The widening of the distribution in the High group is analytically interesting — it may reflect individual differences in caffeine sensitivity or confounding lifestyle variables.
**Conclusion:**
Coffee intake is associated with modestly elevated resting heart rate. The effect is visible in distribution shape rather than median alone, which is why a violin plot was chosen over a boxplot for this metric.
---
### Research Question 4
#### Does coffee intake influence stress level distribution?
**Observation:**
- The bar chart shows that the vast majority of respondents in all groups report Low stress (70–82%), with Medium stress accounting for 18–29%.
- High stress is reported by very few respondents overall: 0.5% (Low), 0.7% (Moderate), and 1.1% (High intake).
- Despite the small absolute numbers, the directional pattern is consistent — as coffee intake increases, the Low-stress share shrinks (81.8% → 70.3%) and the High-stress share grows (0.5% → 1.1%).

**Insight:**
The stress composition shift from Low → High intake is consistent and directional. Whether this reflects a causal relationship (caffeine increasing stress) or a selection effect (already-stressed individuals consuming more coffee) cannot be determined from observational data alone. Both are plausible mechanisms.
**Conclusion:**
Coffee intake is associated with a higher proportion of high-stress reporters and a lower proportion of low-stress reporters. This is one of the most visually clear findings in the dataset.
---
### Research Question 5
#### Who are the high-stress reporters, and how are they distributed across coffee groups?
**Observation:**
- Absolute counts of high-stress reporters: Low intake n=16 (0.5% of group), Moderate n=20 (0.7%), High n=31 (1.1%).
- Although the absolute numbers are small, the High intake group has nearly double the rate of high-stress reporters compared to the Low group.
- The age distributions across all three groups are broadly similar, with median ages in the low-to-mid 30s and comparable spreads.

**Insight:**
The high-stress signal in the High intake group is not concentrated in a specific age bracket — it is distributed across the full age range. This makes it less likely to be a confound driven purely by age and more consistent with a genuine relationship between intake level and reported stress.
**Conclusion:**
High-intake coffee drinkers are overrepresented among high-stress reporters across all age groups, strengthening the stress finding from Research Question 4.
---
### Research Question 6
#### Within the high-stress group, is there a relationship between coffee intake and age?
**Observation:**
- The scatter plot of coffee intake vs. age within the high-stress group shows a very slight positive trend.
### Research Question 6
#### What has a bigger effect on high stress — age or coffee intake?
**Observation:**
- The chart directly compares the absolute correlation strength of Age (r = 0.008) and Coffee Intake (r = 0.041) with high-stress reporting within the high-stress group.
- Coffee Intake is roughly 5× stronger than Age as a predictor of high-stress status.
- Both effects are small in absolute terms, but the comparison clearly identifies coffee intake as the dominant factor between the two.

**Statistical Result:**
| Metric | Value |
|---|---|
| Age effect (absolute r) | 0.008 |
| Coffee Intake effect (absolute r) | 0.041 |
| Dominant factor | Coffee Intake (5× stronger than Age) |
**Insight:**
Age contributes virtually no signal (r = 0.008) while coffee intake, though still weak in absolute terms (r = 0.041), is meaningfully larger by comparison. This rules out age as a confounding explanation for the stress pattern observed across coffee groups.
**Decision:** Age is not a meaningful confound. The stress finding is attributed to coffee intake level, not to demographic age differences between groups.
**Conclusion:**
Coffee intake is a stronger predictor of high-stress reporting than age within this group. Age can be safely excluded as an alternative explanation for the stress findings.
---
## 5. Correlation Structure
| Variable | Correlation with Coffee_Intake | Direction |
|---|---|---|
| Sleep_Hours | −0.17 | Negative (weak) |
| Heart_Rate | ~+0.18 | Positive (weak) |
| BMI | ~+0.02 | Negligible |
| Stress_Score | ~+0.04 | Negligible |
| Age | ~+0.008 | Negligible |
**Key finding:** Sleep and heart rate are the two variables most associated with coffee intake in this dataset. BMI, stress score, and age show negligible linear correlations.
---
## 6. Final Conclusion
This analysis shows that high coffee intake is associated with a consistent directional pattern across health indicators: less sleep, modestly elevated heart rate, and a higher share of high-stress reporters. The effect is clearest for **sleep** (r = −0.17, p < 0.001) and **heart rate**, while BMI shows no meaningful association.
No single metric shows a dramatic signal, but the directional consistency across independent outcomes — sleep, heart rate, and stress all pointing the same way — is the most meaningful finding in the dataset.
### Key Takeaways
| Health Metric | Finding | Signal Strength |
|---|---|---|
| Sleep Hours | Negative association — High group sleeps less per night | Weak (r = −0.17) |
| Heart Rate | Positive association — High group shows elevated BPM | Weak |
| BMI | No meaningful difference across groups | Negligible |
| Stress Level | High group has highest share of high-stress reporters | Small but consistent |
| Age | No meaningful relationship with intake level | Negligible |
---
## 7. Limitations
- The dataset is **synthetic** — relationships are plausible but may not reflect real-world causal mechanisms.
- The analysis is **observational** — no causal claims can be made. Directionality (does coffee cause poor sleep, or do poor sleepers drink more coffee?) cannot be determined here.
- `Coffee_Drinker_Type` groups are constructed from quantile splits; different binning schemes may shift group boundaries slightly.
- Self-reported variables (Stress_Level, Sleep_Quality) are subject to measurement noise.
- Confounders such as occupation, smoking, and alcohol consumption were not controlled for in the univariate analyses presented here.
---
## 8. Notebook & Plots
Full analysis with code: [Google Colab Notebook](notebook.ipynb)
提供机构:
ADAMlam-16



