five

amitbenavraham/usda-nutrition-eda

收藏
Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/amitbenavraham/usda-nutrition-eda
下载链接
链接失效反馈
官方服务:
资源简介:
# 🥗 USDA FoodData Central – Nutrition EDA <video src="https://huggingface.co/datasets/amitbenavraham/usda-nutrition-eda/resolve/main/AmitEda.mp4" controls="controls" style="max-width: 720px;"></video> ## Overview This project presents an end-to-end Exploratory Data Analysis (EDA) of nutritional data from the USDA FoodData Central database. The goal is to uncover patterns in food nutrition, compare food categories, and explore relationships between key nutritional features. - **Source:** [omid5/usda-fdc-foods-cleaned](https://huggingface.co/datasets/omid5/usda-fdc-foods-cleaned) - **Original size:** 501,887 rows × 22 columns - **Final clean size:** 345,226 rows × 11 numeric features - **Target Variable:** `data_type` (branded_food / sr_legacy_food) --- ## Dataset Description - **Source:** [omid5/usda-fdc-foods-cleaned](https://huggingface.co/datasets/omid5/usda-fdc-foods-cleaned) - **Original size:** 501,887 rows × 22 columns - **Final clean size:** 345,226 rows × 11 numeric features - **Target Variable:** `data_type` (branded_food / sr_legacy_food) --- ## Features | Column | Description | |--------|-------------| | Energy | Calories per 100g (kcal) | | Protein | Protein content (g) | | Total lipid (fat) | Total fat (g) | | Fatty acids, total saturated | Saturated fat (g) | | Fatty acids, total trans | Trans fat (g) | | Carbohydrate, by difference | Carbohydrates (g) | | Fiber, total dietary | Dietary fiber (g) | | Cholesterol | Cholesterol (mg) | | Sodium, Na | Sodium (mg) | | Calcium, Ca | Calcium (mg) | | Iron, Fe | Iron (mg) | --- ## Data Cleaning Steps performed: - Dropped irrelevant non-numeric columns (fdc_id, ingredients, serving info) - Checked for duplicates → **0 found** - Dropped columns with more than 50% missing values (Caffeine, Vitamin D, Potassium) - Dropped remaining rows with any missing values - Removed foundation_food category (only 1 row) - **Final clean dataset: 345,226 rows × 11 numeric columns** --- ## Outlier Detection - Used box plots and the IQR method across all 11 numeric columns - Outliers found in every column — highest in Cholesterol (11.53%), Calcium (7.59%), Fiber (6.83%) - **Decision: retained all outliers** — in food data, extreme values are completely legitimate (e.g., pure oil = very high fat, table salt = very high sodium) --- ## Research Question **Can we predict the food category (branded vs. SR legacy) based on its nutritional profile?** --- ## Visualizations & Key Findings ### Q1: What is the average calorie content by food category? **Finding:** SR Legacy foods average **518.6 kcal** vs. **291.6 kcal** for branded foods. SR Legacy includes calorie-dense whole foods like oils and nuts, while branded foods include a much wider variety including low-calorie options like diet drinks and salads. ![Screenshot 2026-04-12 at 0.38.13](https://cdn-uploads.huggingface.co/production/uploads/69da2f3f6f0b4bf96e389dd5/ihOTy_KbY_-tPKs4mKUR4.png) ### Q2: How is protein content distributed across food categories? **Finding:** SR Legacy foods show a wider protein distribution (median ~10g, range 0–48g) compared to branded foods (median ~6g, range 0–23g). SR Legacy includes more protein-dense whole foods like meats, fish, and legumes. ![Screenshot 2026-04-12 at 0.39.03](https://cdn-uploads.huggingface.co/production/uploads/69da2f3f6f0b4bf96e389dd5/5jXRLCdVyBfX109w05iXy.png) ### Q3: Is there a relationship between fat and calorie content? **Finding:** Clear positive relationship between fat and calories — as fat increases, calories increase. Fat contains 9 kcal/g vs. 4 kcal/g for protein and carbohydrates. Both food categories follow the same trend. ![Screenshot 2026-04-12 at 0.40.05](https://cdn-uploads.huggingface.co/production/uploads/69da2f3f6f0b4bf96e389dd5/iyA-cGHZ59PbIwWDDpWtA.png) ### Q4: Which nutritional values are most correlated with each other? **Key correlations:** - Total Fat ↔ Energy: **0.72** (strongest) - Saturated Fat ↔ Total Fat: **0.70** - Carbohydrate ↔ Energy: **0.66** - Saturated Fat ↔ Energy: **0.50** - Fiber ↔ Carbohydrate: **0.38** - Sodium ↔ everything: **~0.00** (no correlation) ![Screenshot 2026-04-12 at 0.40.28](https://cdn-uploads.huggingface.co/production/uploads/69da2f3f6f0b4bf96e389dd5/eWdmiskpfbKsPPsrcb31t.png) --- ## Summary & Conclusions - SR Legacy foods are significantly more caloric than branded foods (518 vs. 291 kcal) - Fat content is the strongest predictor of calories (correlation = 0.72) - SR Legacy foods show higher and wider protein distribution - Sodium is completely independent of all other nutritional features - Nutritional profiles differ meaningfully between commercial branded products and USDA reference foods, supporting our research question --- ## Tools & Technologies - Python (Pandas, NumPy) - Data Visualization (Matplotlib, Seaborn) - Google Colab - HuggingFace Datasets --- ## Author **Amit Ben-Avraham**
提供机构:
amitbenavraham
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作