MikeQian/Nutribench_subset_with_six_languages
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MikeQian/Nutribench_subset_with_six_languages
下载链接
链接失效反馈官方服务:
资源简介:
# NutriBench Multilingual Extension (English, German, Chinese, Lao)
## Dataset Description
This dataset is a 1000 subset with multilingual extension of [NutriBench v2](https://huggingface.co/datasets/dongx1997/NutriBench), designed for cross-lingual evaluation of nutrition estimation from meal descriptions. It retains the original English meal-description field and adds translated versions in German, Chinese, and Lao.
The dataset was constructed to compare model performance under two settings:
1. Direct multilingual estimation, where the model predicts nutrition values from non-English meal descriptions directly.
2. Translation-based estimation, where translated meal descriptions are used to study how translation affects downstream performance.
The dataset preserves the original nutritional labels and metadata from NutriBench v2 while introducing additional multilingual text fields for research purposes.
---
## Dataset Structure
| Column | Description |
|---|---|
| `sample_id` | Unique identifier for each sample |
| `carb` | Carbohydrate content in grams |
| `fat` | Fat content in grams |
| `energy` | Energy value for the meal sample |
| `protein` | Protein content in grams |
| `country` | Country associated with the meal sample |
| `serving_type` | Type of serving description, such as natural-language or metric-style expressions |
| `meal_description_en` | Original English meal description |
| `meal_description_de` | German translation of the meal description |
| `meal_description_zh` | Chinese translation of the meal description |
| `meal_description_lo` | Lao translation of the meal description |
| `meal_description_ja` | Japanese translation of the meal description |
| `meal_description_yue` | Cantonese translation of the meal description |
---
## Source Dataset:
This dataset is derived from:
DongXzz, NutriBench, Hugging Face Datasets, v2
[https://huggingface.co/datasets/dongx1997/NutriBench](https://huggingface.co/datasets/dongx1997/NutriBench)
The nutritional labels and original meal descriptions originate from NutriBench v2. This multilingual extension preserves the original dataset structure while adding translated meal-description fields for cross-lingual evaluation.
---
## Translation Strategy
The multilingual meal-description fields were generated using ChatGPT 5.4. In addition, a small amount of human checking was carried out for the Simplified Chinese translations to assess translation quality and consistency.
The goal of the translation process was to preserve the semantic content of the original meal descriptions as faithfully as possible, while minimizing distortion of quantities, units, and food-item specificity.
### Translation Rules
The following rules were used during translation:
1. Preserve all numbers, quantities, decimals, and measurement units exactly (for example, 26g, 375.0, ml, g).
2. Keep food names specific.
3. Do not add explanations.
4. Do not omit any food item.
5. Output only the translated text.
6. Do not use markdown, JSON, bullet points, or quotation marks unless they are part of the source text.
7. Up to five self-repair attempts were allowed if the output did not satisfy the formatting or fidelity requirements.
---
## Dataset Statistics
Number of samples: 1,001
Number of columns: 13
Languages: English, German, Chinese, Lao
Intended Use
This dataset is intended for:
Evaluation of large language models on multilingual nutrition estimation
Cross-lingual robustness analysis
Comparison between direct multilingual inference and translation-based inference
Research on the effect of language variation in food-description understanding
---
## Limitations
Users should keep the following limitations in mind:
The multilingual meal-description fields are derived from the original English data and may contain translation artifacts.
Language-specific phrasing differences may affect model performance independently of actual nutrition understanding.
Country and serving-type distributions may be uneven, which can influence subgroup-level evaluation results.
Some translation or localization noise may remain in the non-English text fields.
Human verification was limited and was not applied equally across all languages.
---
## Licence
This dataset is released under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
提供机构:
MikeQian



