mrdbourke/food-drink-items-1B
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrdbourke/food-drink-items-1B
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: item
dtype: string
- name: item_type
dtype: string
- name: count_re
dtype: int64
- name: count_org
dtype: int64
- name: count_total
dtype: int64
- name: selection
dtype: string
- name: is_human_edible
dtype: bool
- name: is_generic_label
dtype: bool
- name: is_branded_item
dtype: bool
- name: is_raw_ingredient
dtype: bool
- name: is_dish
dtype: bool
- name: is_container_or_utensil
dtype: bool
- name: food_categories
dtype: string
- name: canonical_form
dtype: string
splits:
- name: train
num_examples: 292783
configs:
- config_name: default
data_files:
- split: train
path: data/train/*.parquet
license: apache-2.0
task_categories:
- text-classification
- zero-shot-classification
language:
- en
tags:
- food
- drink
- food-classification
- food-extraction
- food-vocabulary
- datacomp
- billion-scale
size_categories:
- 100K<n<1M
source_datasets:
- UCSC-VLAA/Recap-DataComp-1B
---
# Food & Drink Items from 1 Billion Image Captions
A structured vocabulary of **292,783 unique food and drink items** extracted from [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) (1 billion image captions), enriched with multi-label category tags and edibility classification.
## Overview
| | Count | Percentage |
|---|---|---|
| **Total items** | 292,783 | 100% |
| **Human edible** | 218,293 | 74.6% |
| **Non-edible** | 74,490 | 25.4% |
| **Food items** | 228,504 | |
| **Drink items** | 64,279 | |
| **Branded items** | 80,143 | |
| **Dishes** | 73,950 | |
| **Raw ingredients** | 38,195 | |
| **Containers/utensils** | 12,091 | |
| **Unique canonical forms** | 277,249 | |
| **Total mentions across 1B captions** | 253,522,547 | |
## How it was made
1. **Source:** [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) — 1 billion image-caption pairs
2. **Text classification:** [mrdbourke/Recap-DataComp-1B-FoodOrDrink](https://huggingface.co/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink) — ~106M rows classified as food/drink
3. **Food extraction:** [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) — extracted specific food/drink item names from both `re_caption` (AI-generated) and `org_caption` (web alt-text)
4. **Counting:** Item frequencies computed across all 106M rows from both caption types
5. **Enrichment:** Each item classified by [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) with structured metadata (edibility, category tags, canonical form)
### Item selection
| Selection | Count | Description |
|---|---|---|
| `threshold` | 95,174 | All items appearing >= 100 times (covers ~84% of all mentions) |
| `random_sample` | 197,609 | 1% random sample of items below threshold (long-tail coverage) |
## Usage
```python
from datasets import load_dataset
import json
ds = load_dataset("mrdbourke/food-drink-items-1B", split="train")
print(f"Total items: {len(ds):,}")
# All human-edible items sorted by frequency
edible = ds.filter(lambda x: x["is_human_edible"]).sort("count_total", reverse=True)
print(f"Edible items: {len(edible):,}")
# Parse multi-label categories
row = ds[0]
categories = json.loads(row["food_categories"])
print(f"{row['item']}: {categories}")
# Find all dishes
dishes = ds.filter(lambda x: '"dish"' in x["food_categories"])
# Find all seafood items
seafood = ds.filter(lambda x: '"seafood"' in x["food_categories"])
# Get branded items
brands = ds.filter(lambda x: x["is_branded_item"])
# Non-food items (useful as training negatives)
non_food = ds.filter(lambda x: not x["is_human_edible"])
# Deduplicate by canonical form (merges "tomatoes" + "Tomatoes" + "tomato")
from collections import defaultdict
canonical = defaultdict(int)
for row in ds:
canonical[row["canonical_form"]] += row["count_total"]
top_100 = sorted(canonical.items(), key=lambda x: -x[1])[:100]
```
## Fields
| Field | Type | Description |
|---|---|---|
| `item` | string | Original extracted item name (normalized to lowercase) |
| `item_type` | string | `"food"` or `"drink"` — which extraction list it came from |
| `count_re` | int | Number of times this item was extracted from `re_caption` (AI-generated captions) |
| `count_org` | int | Number of times this item was extracted from `org_caption` (web alt-text) |
| `count_total` | int | `count_re + count_org` |
| `selection` | string | `"threshold"` (count >= 100) or `"random_sample"` (1% of long tail) |
| `is_human_edible` | bool | Would a person eat or drink this? |
| `is_generic_label` | bool | Is this a vague descriptor? ("red liquid", "food", "beverage") |
| `is_branded_item` | bool | Is this a brand name? ("Coca-Cola", "Jack Daniel's") |
| `is_raw_ingredient` | bool | Single whole/unprocessed ingredient? ("apple" = true, "apple pie" = false) |
| `is_dish` | bool | Prepared/composed dish or recipe? ("pad thai" = true, "rice" = false) |
| `is_container_or_utensil` | bool | Container, vessel, or utensil? ("wine glass", "mug") |
| `food_categories` | string (JSON list) | Multi-label category tags (see below) |
| `canonical_form` | string | Normalized: lowercase, singular, stripped |
## Category tags
The `food_categories` field contains a JSON-encoded list of one or more tags. An item like "shrimp pad thai" would have `["dish", "seafood", "grain"]`.
| Category | Count | % of items |
|---|---|---|
| `non_food` | 68,955 | 23.6% |
| `dish` | 56,216 | 19.2% |
| `drink` | 45,314 | 15.5% |
| `confectionary` | 32,440 | 11.1% |
| `baked_goods` | 30,878 | 10.5% |
| `meat` | 23,899 | 8.2% |
| `vegetable` | 23,113 | 7.9% |
| `fruit` | 21,802 | 7.4% |
| `grain` | 21,735 | 7.4% |
| `liquor` | 21,495 | 7.3% |
| `dairy` | 20,268 | 6.9% |
| `snack` | 16,267 | 5.6% |
| `other` | 15,718 | 5.4% |
| `condiments` | 15,471 | 5.3% |
| `supplement` | 9,374 | 3.2% |
| `seafood` | 9,023 | 3.1% |
| `additive` | 8,731 | 3.0% |
| `herbs_and_spices` | 7,106 | 2.4% |
| `nuts_and_seeds` | 6,097 | 2.1% |
| `sweetener` | 5,020 | 1.7% |
| `frozen_dessert` | 4,868 | 1.7% |
| `legume` | 4,377 | 1.5% |
| `spread` | 3,522 | 1.2% |
| `side_dish` | 2,297 | 0.8% |
| `eggs` | 2,291 | 0.8% |
| `oil` | 2,026 | 0.7% |
| `pet_food` | 1,678 | 0.6% |
| `cereals` | 1,466 | 0.5% |
| `fungi` | 1,368 | 0.5% |
## Caption types
Items were extracted from two different caption types, reflected in the count columns:
- **`count_re`** — from `re_caption`: AI-generated detailed captions (LLaVA-1.5-LLaMA3-8B). Produces generic visual descriptions ("meat", "vegetables", "sauce")
- **`count_org`** — from `org_caption`: Original web-crawled alt-text. Produces specific named items ("wagyu ribeye", "pad thai", "marinara sauce")
Using both gives complementary coverage. High `count_re` items are visually common; high `count_org` items are frequently named on the web.
## Models used
| Stage | Model | Purpose |
|---|---|---|
| Text classification | [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) | Binary food/not-food on 1B captions |
| Food extraction | [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) | Extract item names + tags from captions |
| Item enrichment | [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) | Classify items with structured metadata |
## Limitations
- Item names are extracted by a 270M parameter model and may include errors
- Enrichment labels are from a 9B parameter model, not human-annotated
- Long-tail items (count < 100) are sampled at 1%, not exhaustive
- `food_categories` is stored as a JSON string, not a native list type
- Some canonical forms may not perfectly deduplicate (e.g. regional spellings)
- Counts reflect caption frequency, not real-world food popularity
## License
Apache 2.0 — consistent with source dataset and all models used.
## Source
- **Source dataset**: [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B)
- **Filtered dataset**: [mrdbourke/Recap-DataComp-1B-FoodOrDrink](https://huggingface.co/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink)
- **FoodExtract model**: [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2)
- **Enrichment model**: [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B)
提供机构:
mrdbourke



