five

MikeQian/Nutribench_subset_with_six_languages

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MikeQian/Nutribench_subset_with_six_languages
下载链接
链接失效反馈
官方服务:
资源简介:
# NutriBench Multilingual Extension (English, German, Chinese, Lao) ## Dataset Description This dataset is a 1000 subset with multilingual extension of [NutriBench v2](https://huggingface.co/datasets/dongx1997/NutriBench), designed for cross-lingual evaluation of nutrition estimation from meal descriptions. It retains the original English meal-description field and adds translated versions in German, Chinese, and Lao. The dataset was constructed to compare model performance under two settings: 1. Direct multilingual estimation, where the model predicts nutrition values from non-English meal descriptions directly. 2. Translation-based estimation, where translated meal descriptions are used to study how translation affects downstream performance. The dataset preserves the original nutritional labels and metadata from NutriBench v2 while introducing additional multilingual text fields for research purposes. --- ## Dataset Structure | Column | Description | |---|---| | `sample_id` | Unique identifier for each sample | | `carb` | Carbohydrate content in grams | | `fat` | Fat content in grams | | `energy` | Energy value for the meal sample | | `protein` | Protein content in grams | | `country` | Country associated with the meal sample | | `serving_type` | Type of serving description, such as natural-language or metric-style expressions | | `meal_description_en` | Original English meal description | | `meal_description_de` | German translation of the meal description | | `meal_description_zh` | Chinese translation of the meal description | | `meal_description_lo` | Lao translation of the meal description | | `meal_description_ja` | Japanese translation of the meal description | | `meal_description_yue` | Cantonese translation of the meal description | --- ## Source Dataset: This dataset is derived from: DongXzz, NutriBench, Hugging Face Datasets, v2 [https://huggingface.co/datasets/dongx1997/NutriBench](https://huggingface.co/datasets/dongx1997/NutriBench) The nutritional labels and original meal descriptions originate from NutriBench v2. This multilingual extension preserves the original dataset structure while adding translated meal-description fields for cross-lingual evaluation. --- ## Translation Strategy The multilingual meal-description fields were generated using ChatGPT 5.4. In addition, a small amount of human checking was carried out for the Simplified Chinese translations to assess translation quality and consistency. The goal of the translation process was to preserve the semantic content of the original meal descriptions as faithfully as possible, while minimizing distortion of quantities, units, and food-item specificity. ### Translation Rules The following rules were used during translation: 1. Preserve all numbers, quantities, decimals, and measurement units exactly (for example, 26g, 375.0, ml, g). 2. Keep food names specific. 3. Do not add explanations. 4. Do not omit any food item. 5. Output only the translated text. 6. Do not use markdown, JSON, bullet points, or quotation marks unless they are part of the source text. 7. Up to five self-repair attempts were allowed if the output did not satisfy the formatting or fidelity requirements. --- ## Dataset Statistics Number of samples: 1,001 Number of columns: 13 Languages: English, German, Chinese, Lao Intended Use This dataset is intended for: Evaluation of large language models on multilingual nutrition estimation Cross-lingual robustness analysis Comparison between direct multilingual inference and translation-based inference Research on the effect of language variation in food-description understanding --- ## Limitations Users should keep the following limitations in mind: The multilingual meal-description fields are derived from the original English data and may contain translation artifacts. Language-specific phrasing differences may affect model performance independently of actual nutrition understanding. Country and serving-type distributions may be uneven, which can influence subgroup-level evaluation results. Some translation or localization noise may remain in the non-English text fields. Human verification was limited and was not applied equally across all languages. --- ## Licence This dataset is released under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
提供机构:
MikeQian
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作