amitbenavraham/usda-nutrition-eda
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/amitbenavraham/usda-nutrition-eda
下载链接
链接失效反馈官方服务:
资源简介:
# 🥗 USDA FoodData Central – Nutrition EDA
<video src="https://huggingface.co/datasets/amitbenavraham/usda-nutrition-eda/resolve/main/AmitEda.mp4" controls="controls" style="max-width: 720px;"></video>
## Overview
This project presents an end-to-end Exploratory Data Analysis (EDA) of nutritional data
from the USDA FoodData Central database. The goal is to uncover patterns in food nutrition,
compare food categories, and explore relationships between key nutritional features.
- **Source:** [omid5/usda-fdc-foods-cleaned](https://huggingface.co/datasets/omid5/usda-fdc-foods-cleaned)
- **Original size:** 501,887 rows × 22 columns
- **Final clean size:** 345,226 rows × 11 numeric features
- **Target Variable:** `data_type` (branded_food / sr_legacy_food)
---
## Dataset Description
- **Source:** [omid5/usda-fdc-foods-cleaned](https://huggingface.co/datasets/omid5/usda-fdc-foods-cleaned)
- **Original size:** 501,887 rows × 22 columns
- **Final clean size:** 345,226 rows × 11 numeric features
- **Target Variable:** `data_type` (branded_food / sr_legacy_food)
---
## Features
| Column | Description |
|--------|-------------|
| Energy | Calories per 100g (kcal) |
| Protein | Protein content (g) |
| Total lipid (fat) | Total fat (g) |
| Fatty acids, total saturated | Saturated fat (g) |
| Fatty acids, total trans | Trans fat (g) |
| Carbohydrate, by difference | Carbohydrates (g) |
| Fiber, total dietary | Dietary fiber (g) |
| Cholesterol | Cholesterol (mg) |
| Sodium, Na | Sodium (mg) |
| Calcium, Ca | Calcium (mg) |
| Iron, Fe | Iron (mg) |
---
## Data Cleaning
Steps performed:
- Dropped irrelevant non-numeric columns (fdc_id, ingredients, serving info)
- Checked for duplicates → **0 found**
- Dropped columns with more than 50% missing values (Caffeine, Vitamin D, Potassium)
- Dropped remaining rows with any missing values
- Removed foundation_food category (only 1 row)
- **Final clean dataset: 345,226 rows × 11 numeric columns**
---
## Outlier Detection
- Used box plots and the IQR method across all 11 numeric columns
- Outliers found in every column — highest in Cholesterol (11.53%), Calcium (7.59%), Fiber (6.83%)
- **Decision: retained all outliers** — in food data, extreme values are completely legitimate
(e.g., pure oil = very high fat, table salt = very high sodium)
---
## Research Question
**Can we predict the food category (branded vs. SR legacy) based on its nutritional profile?**
---
## Visualizations & Key Findings
### Q1: What is the average calorie content by food category?
**Finding:** SR Legacy foods average **518.6 kcal** vs. **291.6 kcal** for branded foods.
SR Legacy includes calorie-dense whole foods like oils and nuts, while branded foods
include a much wider variety including low-calorie options like diet drinks and salads.

### Q2: How is protein content distributed across food categories?
**Finding:** SR Legacy foods show a wider protein distribution (median ~10g, range 0–48g)
compared to branded foods (median ~6g, range 0–23g). SR Legacy includes more
protein-dense whole foods like meats, fish, and legumes.

### Q3: Is there a relationship between fat and calorie content?
**Finding:** Clear positive relationship between fat and calories — as fat increases,
calories increase. Fat contains 9 kcal/g vs. 4 kcal/g for protein and carbohydrates.
Both food categories follow the same trend.

### Q4: Which nutritional values are most correlated with each other?
**Key correlations:**
- Total Fat ↔ Energy: **0.72** (strongest)
- Saturated Fat ↔ Total Fat: **0.70**
- Carbohydrate ↔ Energy: **0.66**
- Saturated Fat ↔ Energy: **0.50**
- Fiber ↔ Carbohydrate: **0.38**
- Sodium ↔ everything: **~0.00** (no correlation)

---
## Summary & Conclusions
- SR Legacy foods are significantly more caloric than branded foods (518 vs. 291 kcal)
- Fat content is the strongest predictor of calories (correlation = 0.72)
- SR Legacy foods show higher and wider protein distribution
- Sodium is completely independent of all other nutritional features
- Nutritional profiles differ meaningfully between commercial branded products
and USDA reference foods, supporting our research question
---
## Tools & Technologies
- Python (Pandas, NumPy)
- Data Visualization (Matplotlib, Seaborn)
- Google Colab
- HuggingFace Datasets
---
## Author
**Amit Ben-Avraham**
提供机构:
amitbenavraham



