five

mrdbourke/FoodExtract-135k

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrdbourke/FoodExtract-135k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: features: - name: sequence dtype: string - name: image_url dtype: string - name: class_label dtype: string - name: source dtype: string - name: char_len dtype: float64 - name: word_count dtype: float64 - name: syn_or_real dtype: string - name: uuid dtype: string - name: gpt-oss-120b-label dtype: string - name: target_food_names_to_use list: string - name: caption_detail_level dtype: string - name: cuisine dtype: string - name: num_foods dtype: float64 - name: target_image_point_of_view dtype: string - name: gpt-oss-120b-label-condensed dtype: string --- # FoodExtract-135k Dataset designed for fine-tuning a small LLM (e.g. `gemma-3-270m`) to extract structured data from text in a way which replicates a much larger LLM (e.g. `gpt-oss-120b`). Purpose is to enable a fine-tuned small LLM to filter a large text dataset for food and drink-like items. For example, take DataComp1B dataset and use the fine-tuned LLM to filter for food and drink related items. ## Example sample ```json {'sequence': 'A mouth-watering photograph captures a delectable dish centered on a rectangular white porcelain plate, resting on a rustic wooden tabletop indoors. In the background, a wooden cutting board with a long handle subtly enhances the setting. The plate is adorned with several generously-sized, cheese-stuffed peppers that have been roasted to perfection, their blistered skins marked by charred black spots. Split down the middle, the peppers reveal a creamy white cheese filling, enriched with a blend of aromatic herbs. Once stuffed, the peppers have been closed and roasted, achieving a luscious, smoky flavor. The dish is elegantly garnished with vibrant cherry tomato halves, freshly chopped green herbs, and delicate sprinkles of small diced red onions. A light, possibly citrus-infused dressing, hinted by a sheen of oil or lime juice, gently coats the ensemble, adding an extra layer of freshness. The meticulous presentation and vivid colors make this image not only a feast for the stomach but also a feast for the eyes.', 'image_url': 'http://i.imgur.com/X7cM9Df.jpg', 'class_label': 'food', 'source': 'pixmo_cap_dataset', 'char_len': 1028, 'word_count': 160, 'syn_or_real': 'real', 'uuid': '6720d6e0-5912-41e7-be50-85a2b63bfef9', 'gpt-oss-120b-label': {'is_food_or_drink': True, 'tags': ['fi', 'fa'], 'food_items': ['cheese-stuffed peppers', 'cherry tomato halves', 'green herbs', 'diced red onions', 'citrus-infused dressing', 'oil', 'lime juice', 'cheese'], 'drink_items': []}, 'gpt-oss-120b-label-condensed': 'food_or_drink: 1 tags: fi, fa foods: cheese-stuffed peppers, cherry tomato halves, green herbs, diced red onions, citrus-infused dressing, oil, lime juice, cheese drinks:'} ``` Fields breakdown: | Field | Type | Description | |---|---|---| | `sequence` | `str` | A detailed natural language caption/description of the image, describing the image, can be food related or not. | | `image_url` | `str` | URL pointing to the source image. | | `class_label` | `str` | A high-level category label for the image (e.g. `"food"` or `"not_food"`). | | `source` | `str` | The name of the dataset this sample originated from (e.g. `pixmo_cap_dataset`). | | `char_len` | `int` | Character length of the `sequence` field (e.g. 1028 characters). | | `word_count` | `int` | Word count of the `sequence` field (e.g. 160 words). | | `syn_or_real` | `str` | Indicates whether the image is synthetic or real — `"real"` here means it's a real photograph. | | `uuid` | `str` | A unique identifier (UUID v4) for this particular sample. | | `gpt-oss-120b-label` | `dict` | A structured label produced by `gpt-oss-120b`: | | ↳ `is_food_or_drink` | `bool` | Binary flag — `True` if the image contains food or drink. | | ↳ `tags` | `list[str]` | Short tag codes, see `tags_dict` below. | | ↳ `food_items` | `list[str]` | List of identified food items extracted from the caption/image. | | ↳ `drink_items` | `list[str]` | List of identified drink items — empty here since no drinks are present. | | `gpt-oss-120b-label-condensed` | `str` | A flattened, human-readable string version of `gpt-oss-120b-label`, used for compact generation labels. | ## Tags dictionary mapping ```python tags_dict = {'np': 'nutrition_panel', 'il': 'ingredient list', 'me': 'menu', 're': 'recipe', 'fi': 'food_items', 'di': 'drink_items', 'fa': 'food_advertisement', 'fp': 'food_packaging'} ``` ## Datasets used * Wikipedia cuisines + dishes extract as seed with Gemini 3 Flash captions - 57308 samples * [pixmo_cap](https://huggingface.co/datasets/allenai/pixmo-cap) - 20000 samples * [coyo700m](https://huggingface.co/datasets/kakaobrain/coyo-700m) - 22554 samples * [qwen2vl_open](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data) - 20000 samples * manual_taken_photos - 215 samples * random_string_generation - 5000 samples * synthetic_generation - 10000 samples ## Steps to construct the dataset 1. Collect food/not_food samples from mixed sources (see [Datasets used](#datasets-used)) 2. Label with [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) (large model) using the following prompt: ```python BASE_PROMPT = """Given the following passage of text, please extract the following food and drink related items in the following structure: {"is_food_or_drink": str - bool of true/false as to whether or not the passage of text is related to human edible food or drink items, if true, fill the rest of the items, if false, return the following keys as empty lists, "tags": List[str] - list of string tags related to the text, see food_tags dictionary below for valid tags, "food_items": List[str] - list of human edible food items mentioned in the text, "drink_items": List[str] - list of human edible drink items mentioned in the text} The following tag dictionary describes the valid tags available to tag a passage of text with. Only use keys of the tags dictionary to annotate a passage of text with. <food_tags> {"np": "'nutrition panel' - use this tag if the text describes or mentions a nutrition panel or table", "il": "'ingredient list' - use this tag if the text describes or mentions an ingredients list of human edible foods or drinks of any kind", "me": "'menu' - use this tag if the text describes or mentions a menu of human edible foods or drinks of any kind", "re": "'recipe' - use this tag if the text describes or mentions a recipe of human edible foods or drinks of any kind", "fi": "'food items' - use this tag if the text describes or mentions any human edible food items", "di": "'drink items' - use this tag if the text describes or mentions any human edible drink items", "fa": "'food advertisement' - use this tag if the text describes, mentions or sounds like an advertisement for food or drink items", "fp": "'food packaging' - use this tag if the text describes the look or visuals or mentions food or drink packaging"} </food_tags> A single sample can have multiple tags. For example, if the input text describes a nutrition panel on the back of a package of food with a list of ingredients containing edible food items, the following tags would be used: ["np", "il", "fi", "fp"]. This highlights the presence of "nutrition panel", "ingredients list", "food items" and "food packing". Further instructions: - If the input text is not food or drink related, set is_food_or_drink to false and return empty lists for the rest of the fields. - For food_items and drink_items, only return *exact* items mentioned in the text, do not modify their original spellings or naming attributes. - For food_items and drink_items, limit adjectives to simple food-related descriptions, for example, "perfectly cooked steak" -> "cooked steak". - Return only the valid JSON described in the structure and nothing else. Input text: '<input_text>' """ ``` ## Example usage For an end-to-end example of loading the dataset, fine-tuning a small LLM, and comparing outputs to the original model, see the full notebook: **[Hugging Face LLM Full Fine-Tune Tutorial](https://www.learnhuggingface.com/notebooks/hugging_face_llm_full_fine_tune_tutorial)** The notebook covers the following steps: 1. Load the dataset from Hugging Face 2. Extract samples (the ideal use case is to train an LLM to go from `item["sequence"]` → `item["gpt-oss-120b-label-condensed"]`) 3. Fine-tune a small LLM (e.g. `gemma-3-270m`) 4. Compare small LLM outputs to the original `gpt-oss-120b` labels

许可证:MIT 数据集信息: 特征字段: - 字段名:sequence 数据类型:字符串 - 字段名:image_url 数据类型:字符串 - 字段名:class_label 数据类型:字符串 - 字段名:source 数据类型:字符串 - 字段名:char_len 数据类型:float64 - 字段名:word_count 数据类型:float64 - 字段名:syn_or_real 数据类型:字符串 - 字段名:uuid 数据类型:字符串 - 字段名:gpt-oss-120b-label 数据类型:字符串 - 字段名:target_food_names_to_use 数据类型:字符串列表 - 字段名:caption_detail_level 数据类型:字符串 - 字段名:cuisine 数据类型:字符串 - 字段名:num_foods 数据类型:float64 - 字段名:target_image_point_of_view 数据类型:字符串 - 字段名:gpt-oss-120b-label-condensed 数据类型:字符串 ## FoodExtract-135k 数据集 本数据集专为微调小型大语言模型(Large Language Model,LLM,例如`gemma-3-270m`)而设计,旨在使其从文本中提取结构化数据的能力复刻超大型大语言模型(例如`gpt-oss-120b`)的表现。 其核心目标是使经微调后的小型大语言模型能够从大规模文本数据集中筛选出与食品及饮品相关的内容。例如,可借助本数据集微调后的大语言模型,对DataComp1B数据集进行食品与饮品相关内容的筛选。 ## 示例样本 json {'sequence': 'A mouth-watering photograph captures a delectable dish centered on a rectangular white porcelain plate, resting on a rustic wooden tabletop indoors. In the background, a wooden cutting board with a long handle subtly enhances the setting. The plate is adorned with several generously-sized, cheese-stuffed peppers that have been roasted to perfection, their blistered skins marked by charred black spots. Split down the middle, the peppers reveal a creamy white cheese filling, enriched with a blend of aromatic herbs. Once stuffed, the peppers have been closed and roasted, achieving a luscious, smoky flavor. The dish is elegantly garnished with vibrant cherry tomato halves, freshly chopped green herbs, and delicate sprinkles of small diced red onions. A light, possibly citrus-infused dressing, hinted by a sheen of oil or lime juice, gently coats the ensemble, adding an extra layer of freshness. The meticulous presentation and vivid colors make this image not only a feast for the stomach but also a feast for the eyes.', 'image_url': 'http://i.imgur.com/X7cM9Df.jpg', 'class_label': 'food', 'source': 'pixmo_cap_dataset', 'char_len': 1028, 'word_count': 160, 'syn_or_real': 'real', 'uuid': '6720d6e0-5912-41e7-be50-85a2b63bfef9', 'gpt-oss-120b-label': {'is_food_or_drink': True, 'tags': ['fi', 'fa'], 'food_items': ['cheese-stuffed peppers', 'cherry tomato halves', 'green herbs', 'diced red onions', 'citrus-infused dressing', 'oil', 'lime juice', 'cheese'], 'drink_items': []}, 'gpt-oss-120b-label-condensed': 'food_or_drink: 1 tags: fi, fa foods: cheese-stuffed peppers, cherry tomato halves, green herbs, diced red onions, citrus-infused dressing, oil, lime juice, cheese drinks:'} ## 字段明细 | 字段名 | 数据类型 | 字段说明 | |---|---|---| | `sequence` | 字符串 | 图像的详细自然语言描述文本,可与食品相关或无关。 | | `image_url` | 字符串 | 指向源图像的URL地址。 | | `class_label` | 字符串 | 图像的高级分类标签(例如`"food"`或`"not_food"`)。 | | `source` | 字符串 | 该样本所属的原始数据集名称(例如`pixmo_cap_dataset`)。 | | `char_len` | 整数 | `sequence`字段的字符长度(例如1028个字符)。 | | `word_count` | 整数 | `sequence`字段的单词数(例如160个单词)。 | | `syn_or_real` | 字符串 | 标识图像是合成生成还是实拍采集——此处的`"real"`代表实拍照片。 | | `uuid` | 字符串 | 该样本的唯一标识符(UUID v4版本)。 | | `gpt-oss-120b-label` | 字典 | 由`gpt-oss-120b`生成的结构化标签: | | ↳ `is_food_or_drink` | 布尔值 | 二元标记——若图像包含食品或饮品,则为`True`。 | | ↳ `tags` | 字符串列表 | 短标签代码,详见下文的`tags_dict`。 | | ↳ `food_items` | 字符串列表 | 从描述文本或图像中识别出的食品条目列表。 | | ↳ `drink_items` | 字符串列表 | 从描述文本或图像中识别出的饮品条目列表——本示例为空,因未涉及饮品。 | | `gpt-oss-120b-label-condensed` | 字符串 | `gpt-oss-120b-label`的扁平化、人类可读字符串版本,用于紧凑的生成式标签。 | ## 标签字典映射 python tags_dict = {'np': '营养成分表(nutrition_panel)', 'il': '配料表(ingredient list)', 'me': '菜单(menu)', 're': '食谱(recipe)', 'fi': '食品条目(food_items)', 'di': '饮品条目(drink_items)', 'fa': '食品广告(food_advertisement)', 'fp': '食品包装(food_packaging)'} ## 所用数据集 * 以维基百科美食与菜品提取内容为种子数据,搭配Gemini 3 Flash生成的描述文本,共57308条样本 * [pixmo_cap](https://huggingface.co/datasets/allenai/pixmo-cap) —— 20000条样本 * [coyo700m](https://huggingface.co/datasets/kakaobrain/coyo-700m) —— 22554条样本 * [qwen2vl_open](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data) —— 20000条样本 * 手动拍摄照片数据集 —— 215条样本 * 随机字符串生成数据集 —— 5000条样本 * 合成生成数据集 —— 10000条样本 ## 数据集构建步骤 1. 从混合来源中收集食品/非食品类样本(详见[所用数据集](#datasets-used)章节) 2. 使用以下提示词,通过[`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b)(超大型模型)进行标注: python BASE_PROMPT = """请基于以下文本段落,按照下述结构提取与人类可食用食品及饮品相关的内容: {"is_food_or_drink": 布尔值——标识文本段落是否与人类可食用的食品或饮品相关,若为真,则填充其余字段;若为假,则将以下所有键的值设为空列表, "tags": 字符串列表——与文本相关的短标签,有效标签详见下文的食品标签字典, "food_items": 字符串列表——文本中提及的人类可食用食品条目列表, "drink_items": 字符串列表——文本中提及的人类可食用饮品条目列表} 下述标签字典列出了可用于标注文本段落的有效标签。 仅可使用该标签字典中的键来标注文本段落。 <food_tags> {"np": "'营养成分表'——若文本提及或描述了营养成分表或营养信息表格,则使用该标签", "il": "'配料表'——若文本提及或描述了任意人类可食用食品或饮品的配料清单,则使用该标签", "me": "'菜单'——若文本提及或描述了任意人类可食用食品或饮品的菜单,则使用该标签", "re": "'食谱'——若文本提及或描述了任意人类可食用食品或饮品的制作配方,则使用该标签", "fi": "'食品条目'——若文本提及或描述了任意人类可食用食品条目,则使用该标签", "di": "'饮品条目'——若文本提及或描述了任意人类可食用饮品条目,则使用该标签", "fa": "'食品广告'——若文本描述、提及或类似食品或饮品的广告内容,则使用该标签", "fp": "'食品包装'——若文本描述外观、视觉效果或提及食品或饮品的包装,则使用该标签"} </food_tags> 单个样本可同时使用多个标签。 例如,若输入文本描述了食品包装背面的营养成分表,且其中包含可食用食品的配料清单,则应使用以下标签:["np", "il", "fi", "fp"],这分别对应“营养成分表”“配料表”“食品条目”与“食品包装”。 额外说明: - 若输入文本与食品或饮品无关,请将is_food_or_drink设为false,并将其余字段设为空列表。 - 对于food_items与drink_items,仅返回文本中*精确提及*的条目,不得修改其原始拼写或命名属性。 - 对于food_items与drink_items,仅可使用简单的食品相关形容词进行修饰,例如将“perfectly cooked steak”简化为“cooked steak”。 - 仅返回符合上述结构的合法JSON格式内容,不得添加其他额外内容。 输入文本: '<input_text>' """ ## 示例用法 如需完整的数据集加载、小型大语言模型微调以及模型输出对比的端到端示例,请参阅完整教程笔记本:**[Hugging Face 大语言模型全量微调教程](https://www.learnhuggingface.com/notebooks/hugging_face_llm_full_fine_tune_tutorial)** 该笔记本涵盖以下步骤: 1. 从Hugging Face加载数据集 2. 提取样本(理想使用场景为训练大语言模型,使其能够将`item["sequence"]`转换为`item["gpt-oss-120b-label-condensed"]`) 3. 微调小型大语言模型(例如`gemma-3-270m`) 4. 将小型大语言模型的输出与原始`gpt-oss-120b`生成的标签进行对比
提供机构:
mrdbourke
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作