kumbh/neurolab-health-nutrition
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kumbh/neurolab-health-nutrition
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
task_categories:
- text-generation
- question-answering
task_ids:
- language-modeling
- open-domain-qa
tags:
- nutrition
- health
- food-safety
- dietetics
- india
- food-additives
- glycaemic-index
- nlp
- instruction-tuning
pretty_name: NeuroLab Health & Nutrition Knowledge Base
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: data/train.jsonl
- split: validation
path: data/valid.jsonl
- config_name: parquet
data_files:
- split: train
path: data/dataset.parquet
---
# NeuroLab Health & Nutrition Knowledge Base
A curated instruction-tuning dataset for health and nutrition AI assistants, with a focus on Indian dietary guidelines, food safety, and packaged food analysis.
## Dataset Description
This dataset provides question-answer pairs formatted for supervised fine-tuning (SFT) of large language models. It covers:
- **E-numbers / Food Additives** — Safety profiles, origins, regulatory status
- **Glycaemic Index (GI)** — GI values and glycaemic load for 40+ common foods including Indian staples
- **ICMR Dietary Guidelines** — Recommended Dietary Allowances for Indian population groups (ICMR-NIN 2020)
- **NOVA Classification** — Ultra-processed food identification (Groups 1-4)
- **Nutrient Deficiency Guide** — Symptoms, at-risk groups, food sources, and absorption tips for 8 key nutrients
- **Packaged Food Analysis** — Products from Open Food Facts with nutritional breakdowns
- **USDA Nutritional Composition** — Per-100g nutritional data for common whole foods
## Data Splits
| Split | Rows |
|------------|----------|
| Train | 2,781 |
| Validation | 309 |
| **Total** | **3,090** |
## Data Fields
Each example contains:
```json
{
"conversations": [
{"from": "system", "value": "You are NeuroLab AI..."},
{"from": "human", "value": "What is the GI of brown rice?"},
{"from": "gpt", "value": "The glycaemic index of brown rice is 50..."}
]
}
```
The Parquet split additionally includes `question`, `answer`, `source`, and `quality` fields for easy filtering.
## Sources
- ICMR-NIN 2020 Dietary Guidelines (India)
- Glycaemic Index Database (Atkinson et al., 2021)
- E-number / Food Additives Reference (EU Regulation 1333/2008)
- NOVA Food Processing Classification (Monteiro et al.)
- Open Food Facts (openfoodfacts.org)
- USDA FoodData Central
- Expert-curated Q&A pairs
## Intended Use
- Fine-tuning language models for health and nutrition question-answering
- Building RAG pipelines for food safety and dietary guidance
- Research on Indian dietary patterns and food labelling
## Limitations
- Nutritional values are reference averages; individual food products vary.
- ICMR RDAs are specific to the Indian population and may differ from WHO or USDA recommendations.
- Packaged food data from Open Food Facts may be incomplete or user-contributed.
- This dataset is for educational purposes and should not replace personalised medical advice.
## License
[Creative Commons Attribution 4.0 (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
## Citation
```bibtex
@dataset{neurolab_nutrition_2024,
title = {NeuroLab Health \& Nutrition Knowledge Base},
author = {NeuroLab},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/kumbh/neurolab-health-nutrition}
}
```
license: CC BY 4.0
language:
- 英语
task_categories:
- 文本生成
- 问答
task_ids:
- 语言建模
- 开放域问答
tags:
- 营养学
- 健康
- 食品安全
- 饮食学
- 印度
- 食品添加剂
- 血糖生成指数
- 自然语言处理
- 指令微调
pretty_name: NeuroLab健康与营养知识库
size_categories:
- 1000 < 样本量 < 10000
configs:
- config_name: default
data_files:
- split: 训练集
path: data/train.jsonl
- split: 验证集
path: data/valid.jsonl
- config_name: parquet
data_files:
- split: 训练集
path: data/dataset.parquet
# NeuroLab健康与营养知识库
专为健康与营养类AI智能体打造的经人工精选指令微调数据集,重点覆盖印度膳食指南、食品安全及预包装食品分析场景。
## 数据集说明
本数据集提供面向大语言模型(Large Language Model, LLM)监督微调(Supervised Fine-Tuning, SFT)的问答对格式数据,涵盖以下内容:
- **E编码/食品添加剂**:安全特性、来源及监管状态
- **血糖生成指数(Glycaemic Index, GI)**:涵盖40余种常见食品(含印度主食)的GI值及血糖负荷数据
- **印度医学研究理事会-国家营养研究所(ICMR-NIN)2020版膳食指南**:针对印度不同人群群体的推荐膳食摄入量标准
- **NOVA食品加工分类体系**:超加工食品识别体系(1-4类)
- **营养素缺乏指南**:8种关键营养素的缺乏症状、高危人群、食物来源及吸收建议
- **预包装食品分析**:来自开放食品事实数据库(Open Food Facts)的产品及其营养成分拆解数据
- **美国农业部(United States Department of Agriculture, USDA)营养成分数据库**:常见天然食品每100克的营养成分数据
## 数据拆分
| 拆分方式 | 样本数量 |
|---------|----------|
| 训练集 | 2781 |
| 验证集 | 309 |
| **总计** | **3090** |
## 数据字段
每条样本包含以下内容:
json
{
"conversations": [
{"from": "system", "value": "你是NeuroLab AI..."},
{"from": "human", "value": "糙米的血糖生成指数是多少?"},
{"from": "gpt", "value": "糙米的血糖生成指数为50……"}
]
}
帕奎特格式拆分数据集额外包含`question`(问题)、`answer`(答案)、`source`(来源)及`quality`(质量标签)字段,便于快速筛选。
## 数据来源
- 印度医学研究理事会-国家营养研究所2020版膳食指南
- 血糖生成指数数据库(Atkinson等人,2021年)
- E编码/食品添加剂参考标准(欧盟法规1333/2008)
- NOVA食品加工分类体系(Monteiro等人)
- 开放食品事实数据库(openfoodfacts.org)
- 美国农业部食品数据中心(USDA FoodData Central)
- 专家精选问答对
## 预期用途
- 面向健康与营养问答场景的大语言模型微调
- 搭建面向食品安全与膳食指导的检索增强生成(Retrieval-Augmented Generation, RAG)流程
- 开展印度膳食模式与食品标签相关研究
## 局限性说明
- 营养数值为参考平均值,具体食品产品的实际数值可能存在差异。
- 印度医学研究理事会推荐膳食摄入量标准仅适用于印度人群,可能与世界卫生组织(WHO)或美国农业部的推荐标准存在差异。
- 开放食品事实数据库提供的预包装食品数据可能存在不完整或由用户贡献的情况。
- 本数据集仅用于教育用途,不可替代个性化医疗建议。
## 许可证
[知识共享署名4.0国际许可协议(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
## 引用格式
bibtex
@dataset{neurolab_nutrition_2024,
title = {NeuroLab健康与营养知识库},
author = {NeuroLab},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/kumbh/neurolab-health-nutrition}
}
提供机构:
kumbh



