mrdbourke/food-drink-items-1B

Name: mrdbourke/food-drink-items-1B
Creator: mrdbourke
Published: 2026-03-21 06:31:51
License: 暂无描述

Hugging Face2026-03-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/mrdbourke/food-drink-items-1B

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: item dtype: string - name: item_type dtype: string - name: count_re dtype: int64 - name: count_org dtype: int64 - name: count_total dtype: int64 - name: selection dtype: string - name: is_human_edible dtype: bool - name: is_generic_label dtype: bool - name: is_branded_item dtype: bool - name: is_raw_ingredient dtype: bool - name: is_dish dtype: bool - name: is_container_or_utensil dtype: bool - name: food_categories dtype: string - name: canonical_form dtype: string splits: - name: train num_examples: 292783 configs: - config_name: default data_files: - split: train path: data/train/*.parquet license: apache-2.0 task_categories: - text-classification - zero-shot-classification language: - en tags: - food - drink - food-classification - food-extraction - food-vocabulary - datacomp - billion-scale size_categories: - 100K<n<1M source_datasets: - UCSC-VLAA/Recap-DataComp-1B --- # Food & Drink Items from 1 Billion Image Captions A structured vocabulary of **292,783 unique food and drink items** extracted from [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) (1 billion image captions), enriched with multi-label category tags and edibility classification. ## Overview | | Count | Percentage | |---|---|---| | **Total items** | 292,783 | 100% | | **Human edible** | 218,293 | 74.6% | | **Non-edible** | 74,490 | 25.4% | | **Food items** | 228,504 | | | **Drink items** | 64,279 | | | **Branded items** | 80,143 | | | **Dishes** | 73,950 | | | **Raw ingredients** | 38,195 | | | **Containers/utensils** | 12,091 | | | **Unique canonical forms** | 277,249 | | | **Total mentions across 1B captions** | 253,522,547 | | ## How it was made 1. **Source:** [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) — 1 billion image-caption pairs 2. **Text classification:** [mrdbourke/Recap-DataComp-1B-FoodOrDrink](https://huggingface.co/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink) — ~106M rows classified as food/drink 3. **Food extraction:** [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) — extracted specific food/drink item names from both `re_caption` (AI-generated) and `org_caption` (web alt-text) 4. **Counting:** Item frequencies computed across all 106M rows from both caption types 5. **Enrichment:** Each item classified by [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) with structured metadata (edibility, category tags, canonical form) ### Item selection | Selection | Count | Description | |---|---|---| | `threshold` | 95,174 | All items appearing >= 100 times (covers ~84% of all mentions) | | `random_sample` | 197,609 | 1% random sample of items below threshold (long-tail coverage) | ## Usage ```python from datasets import load_dataset import json ds = load_dataset("mrdbourke/food-drink-items-1B", split="train") print(f"Total items: {len(ds):,}") # All human-edible items sorted by frequency edible = ds.filter(lambda x: x["is_human_edible"]).sort("count_total", reverse=True) print(f"Edible items: {len(edible):,}") # Parse multi-label categories row = ds[0] categories = json.loads(row["food_categories"]) print(f"{row['item']}: {categories}") # Find all dishes dishes = ds.filter(lambda x: '"dish"' in x["food_categories"]) # Find all seafood items seafood = ds.filter(lambda x: '"seafood"' in x["food_categories"]) # Get branded items brands = ds.filter(lambda x: x["is_branded_item"]) # Non-food items (useful as training negatives) non_food = ds.filter(lambda x: not x["is_human_edible"]) # Deduplicate by canonical form (merges "tomatoes" + "Tomatoes" + "tomato") from collections import defaultdict canonical = defaultdict(int) for row in ds: canonical[row["canonical_form"]] += row["count_total"] top_100 = sorted(canonical.items(), key=lambda x: -x[1])[:100] ``` ## Fields | Field | Type | Description | |---|---|---| | `item` | string | Original extracted item name (normalized to lowercase) | | `item_type` | string | `"food"` or `"drink"` — which extraction list it came from | | `count_re` | int | Number of times this item was extracted from `re_caption` (AI-generated captions) | | `count_org` | int | Number of times this item was extracted from `org_caption` (web alt-text) | | `count_total` | int | `count_re + count_org` | | `selection` | string | `"threshold"` (count >= 100) or `"random_sample"` (1% of long tail) | | `is_human_edible` | bool | Would a person eat or drink this? | | `is_generic_label` | bool | Is this a vague descriptor? ("red liquid", "food", "beverage") | | `is_branded_item` | bool | Is this a brand name? ("Coca-Cola", "Jack Daniel's") | | `is_raw_ingredient` | bool | Single whole/unprocessed ingredient? ("apple" = true, "apple pie" = false) | | `is_dish` | bool | Prepared/composed dish or recipe? ("pad thai" = true, "rice" = false) | | `is_container_or_utensil` | bool | Container, vessel, or utensil? ("wine glass", "mug") | | `food_categories` | string (JSON list) | Multi-label category tags (see below) | | `canonical_form` | string | Normalized: lowercase, singular, stripped | ## Category tags The `food_categories` field contains a JSON-encoded list of one or more tags. An item like "shrimp pad thai" would have `["dish", "seafood", "grain"]`. | Category | Count | % of items | |---|---|---| | `non_food` | 68,955 | 23.6% | | `dish` | 56,216 | 19.2% | | `drink` | 45,314 | 15.5% | | `confectionary` | 32,440 | 11.1% | | `baked_goods` | 30,878 | 10.5% | | `meat` | 23,899 | 8.2% | | `vegetable` | 23,113 | 7.9% | | `fruit` | 21,802 | 7.4% | | `grain` | 21,735 | 7.4% | | `liquor` | 21,495 | 7.3% | | `dairy` | 20,268 | 6.9% | | `snack` | 16,267 | 5.6% | | `other` | 15,718 | 5.4% | | `condiments` | 15,471 | 5.3% | | `supplement` | 9,374 | 3.2% | | `seafood` | 9,023 | 3.1% | | `additive` | 8,731 | 3.0% | | `herbs_and_spices` | 7,106 | 2.4% | | `nuts_and_seeds` | 6,097 | 2.1% | | `sweetener` | 5,020 | 1.7% | | `frozen_dessert` | 4,868 | 1.7% | | `legume` | 4,377 | 1.5% | | `spread` | 3,522 | 1.2% | | `side_dish` | 2,297 | 0.8% | | `eggs` | 2,291 | 0.8% | | `oil` | 2,026 | 0.7% | | `pet_food` | 1,678 | 0.6% | | `cereals` | 1,466 | 0.5% | | `fungi` | 1,368 | 0.5% | ## Caption types Items were extracted from two different caption types, reflected in the count columns: - **`count_re`** — from `re_caption`: AI-generated detailed captions (LLaVA-1.5-LLaMA3-8B). Produces generic visual descriptions ("meat", "vegetables", "sauce") - **`count_org`** — from `org_caption`: Original web-crawled alt-text. Produces specific named items ("wagyu ribeye", "pad thai", "marinara sauce") Using both gives complementary coverage. High `count_re` items are visually common; high `count_org` items are frequently named on the web. ## Models used | Stage | Model | Purpose | |---|---|---| | Text classification | [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) | Binary food/not-food on 1B captions | | Food extraction | [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) | Extract item names + tags from captions | | Item enrichment | [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) | Classify items with structured metadata | ## Limitations - Item names are extracted by a 270M parameter model and may include errors - Enrichment labels are from a 9B parameter model, not human-annotated - Long-tail items (count < 100) are sampled at 1%, not exhaustive - `food_categories` is stored as a JSON string, not a native list type - Some canonical forms may not perfectly deduplicate (e.g. regional spellings) - Counts reflect caption frequency, not real-world food popularity ## License Apache 2.0 — consistent with source dataset and all models used. ## Source - **Source dataset**: [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) - **Filtered dataset**: [mrdbourke/Recap-DataComp-1B-FoodOrDrink](https://huggingface.co/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink) - **FoodExtract model**: [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) - **Enrichment model**: [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B)

提供机构：

mrdbourke

5,000+

优质数据集

54 个

任务类型

进入经典数据集