mrdbourke/Recap-DataComp-1B-FoodOrDrink
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: label
dtype: string
- name: score
dtype: float64
- name: url
dtype: string
- name: re_caption
dtype: string
- name: org_caption
dtype: string
- name: sha256
dtype: string
- name: key
dtype: string
- name: re_clip_score
dtype: float64
- name: org_clip_score
dtype: float64
- name: re_length
dtype: int64
- name: org_length
dtype: int64
- name: re_gpt4v_score
dtype: int64
- name: org_gpt4v_score
dtype: int64
- name: re_caption_condition_diverse_topk
dtype: string
- name: re_condition_length
dtype: int64
- name: food_extract_is_food_or_drink_re
dtype: bool
- name: food_extract_raw_re
dtype: string
- name: food_extract_json_re
dtype: string
- name: food_extract_is_food_or_drink_org
dtype: bool
- name: food_extract_raw_org
dtype: string
- name: food_extract_json_org
dtype: string
splits:
- name: train
num_examples: 106230157
configs:
- config_name: default
data_files:
- split: train
path: data/train/*.parquet
license: apache-2.0
task_categories:
- text-classification
- image-classification
- zero-shot-classification
language:
- en
tags:
- food
- drink
- food-classification
- food-extraction
- datacomp
- image-text
- caption-classification
- filtered-dataset
size_categories:
- 100M<n<1B
source_datasets:
- UCSC-VLAA/Recap-DataComp-1B
---
# Recap-DataComp-1B: Food or Drink
A filtered subset of [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) containing **106,230,157 rows** classified as food/drink content, enriched with structured food/drink extraction from [FoodExtract-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2).
## Overview
| | Count | Percentage |
|---|---|---|
| **Total rows** | 106,230,157 | 100% |
| **Food/drink** (Stage 5 label) | 96,618,895 | 91.0% |
| **Not food/drink** (Stage 5 label) | 9,611,262 | 9.0% |
| **FoodExtract (re_caption): food/drink** | 79,519,489 | 74.9% |
| **FoodExtract (re_caption): not food/drink** | 26,710,156 | 25.1% |
| **FoodExtract (re_caption): null (failed)** | 512 | 0.00% |
| **Label vs FoodExtract (re_caption) agree** | 88,114,651 | 82.9% |
| **Label vs FoodExtract (re_caption) disagree** | 18,114,994 | 17.1% |
| **FoodExtract (org_caption): food/drink** | 67,807,188 | 63.8% |
| **FoodExtract (org_caption): not food/drink** | 38,409,170 | 36.2% |
| **FoodExtract (org_caption): null (failed)** | 13,799 | 0.01% |
| **Source shards processed** | 4,627 / 4,627 | 100.0% |
| **Files** | 4,627 parquet files | |
| **Size on disk** | 59.43 GB | |
## How it was made
This dataset was created through a multi-stage knowledge distillation and enrichment pipeline:
1. **Teacher labeling** (Stage 1): [ModernBERT-large-zeroshot-v2.0](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0) (400M params) classified 10M captions from Recap-DataComp-1B as `food or drink` / `not food or drink` using zero-shot NLI
2. **Training data** (Stage 2): The labeled 10M rows were balanced 50/50 and uploaded as [mrdbourke/food-or-drink-10m](https://huggingface.co/datasets/mrdbourke/food-or-drink-10m) (1.57M rows)
3. **Student fine-tuning** (Stage 3): [jhu-clsp/ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m) was fine-tuned on the training data + [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) to create [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) (94.6% accuracy, 0.9475 F1)
4. **Full inference** (Stage 5): The fine-tuned classifier processed all **1 billion rows** of Recap-DataComp-1B (`condition_diverse_topk` subset) at ~4,900 rows/s on an RTX 4090
5. **Filtering** (Stage 5): All food/drink rows saved (100%) + 10% random sample of not-food/drink rows
6. **FoodExtract enrichment** (Stage 7): Every row was processed through [FoodExtract-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) (Gemma 3 270M fine-tuned on 135K samples) to extract structured food/drink items, tags, and an independent food/drink classification
### Pipeline diagram
```
Recap-DataComp-1B (1B rows, condition_diverse_topk subset)
│
├── 10M sample ──→ ModernBERT-large zero-shot NLI (teacher)
│ │
│ ▼
│ food-or-drink-10m (1.57M balanced labels)
│ │
│ ▼
│ Fine-tune Ettin-encoder-150m (student)
│ │
▼ ▼
Full 1B rows ──→ Ettin classifier (4,900 rows/s, fp16)
│
▼
~110M classified rows (label + score)
│
▼
FoodExtract-v2 (Gemma 3 270M, ~1,300 rows/s)
│
▼
This dataset (106.2M rows, enriched)
```
## FoodExtract enrichment (Stage 7 + 7.5)
After the initial text classification (Stage 5), every row was processed through [FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2), a 270M parameter language model fine-tuned on [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) (135K samples labeled by `gpt-oss-120b`).
FoodExtract does three things for each caption:
1. **Re-classifies** the text as food/drink or not (independent of the Stage 5 label)
2. **Tags** the text with category labels (e.g. `fi` = food items, `di` = drink items, `re` = recipe, `me` = menu, `il` = ingredient list, `np` = nutrition panel, `fa` = food advertisement, `fp` = food packaging)
3. **Extracts** specific food and drink item names as structured lists
FoodExtract was run on **both** caption types:
- **`re_caption`** (AI-generated detailed captions from LLaVA) → `_re` suffix columns — produces generic visual descriptions (e.g. "meat", "vegetables", "sauce")
- **`org_caption`** (original web-crawled alt-text) → `_org` suffix columns — produces specific named items (e.g. "wagyu ribeye", "pad thai", "marinara")
Using both gives complementary coverage: `re_caption` for visual content and `org_caption` for specific named items.
This enables fine-grained filtering beyond binary classification — for example, finding all rows that contain recipes, or searching for specific ingredients across the dataset.
### FoodExtract (re_caption) vs Stage 5 label agreement
| | Count | Percentage |
|---|---|---|
| **Agree** (both say same) | 88,114,651 | 82.9% |
| **Disagree** (different) | 18,114,994 | 17.1% |
| label=food, FoodExtract=not food | 17,606,960 | 16.6% |
| label=not_food, FoodExtract=food | 508,034 | 0.5% |
The `label=food, FoodExtract=not food` rows are Stage 5 false positives — captions that mention food-adjacent items (jars, packaging, kitchenware) but don't actually describe food or drink. These can be filtered out using `food_extract_is_food_or_drink_re`.
### Tag dictionary
| Tag | Meaning |
|---|---|
| `fi` | Food items |
| `di` | Drink items |
| `re` | Recipe |
| `me` | Menu |
| `il` | Ingredient list |
| `np` | Nutrition panel |
| `fa` | Food advertisement |
| `fp` | Food packaging |
### FoodExtract JSON schema
Both `food_extract_json_re` and `food_extract_json_org` contain a JSON string with this structure:
```json
{
"is_food_or_drink": true,
"tags": ["fi", "di"],
"food_items": ["bacon", "eggs", "toast"],
"drink_items": ["orange juice"]
}
```
## Usage
```python
from datasets import load_dataset
# Stream the dataset (recommended for this size)
ds = load_dataset("mrdbourke/Recap-DataComp-1B-FoodOrDrink", split="train", streaming=True)
for row in ds:
print(row["label"], row["score"], row["re_caption"][:80])
break
# Load fully into memory (requires ~59.43 GB RAM)
ds = load_dataset("mrdbourke/Recap-DataComp-1B-FoodOrDrink", split="train")
print(f"Total rows: {len(ds):,}")
# Filter to food/drink only (Stage 5 label)
food = ds.filter(lambda x: x["label"] == "food_or_drink")
# Filter by confidence (recommended for high-precision applications)
high_conf_food = ds.filter(lambda x: x["label"] == "food_or_drink" and x["score"] >= 0.95)
# Get the not-food/drink evaluation sample
not_food = ds.filter(lambda x: x["label"] == "not_food_or_drink")
# Use FoodExtract columns for fine-grained filtering
import json
# Filter to rows where BOTH classifiers agree it's food (re_caption)
high_conf = ds.filter(lambda x: x["label"] == "food_or_drink" and x["food_extract_is_food_or_drink_re"] == True)
# Extract specific food items from re_caption
row = ds[0]
fe_re = json.loads(row["food_extract_json_re"])
print(f"Foods (re_caption): {fe_re['food_items']}")
print(f"Drinks (re_caption): {fe_re['drink_items']}")
# Extract specific food items from org_caption (often more specific names)
fe_org = json.loads(row["food_extract_json_org"])
print(f"Foods (org_caption): {fe_org['food_items']}")
print(f"Drinks (org_caption): {fe_org['drink_items']}")
# Combine both caption types for maximum coverage
all_foods = set(fe_re['food_items'] + fe_org['food_items'])
# Find all rows tagged as recipes
recipes = ds.filter(lambda x: '"re"' in (x["food_extract_json_re"] or ""))
# Remove Stage 5 false positives (label=food but FoodExtract says no)
cleaned = ds.filter(lambda x: x["food_extract_is_food_or_drink_re"] == True)
```
## Fields
### Stage 5 columns (text classification)
| Field | Type | Description |
|---|---|---|
| `label` | string | `food_or_drink` or `not_food_or_drink` — from Stage 5 Ettin classifier |
| `score` | float | Classifier confidence (0.5–1.0) |
| `url` | string | Original image URL from DataComp-1B |
| `re_caption` | string | AI-generated detailed caption (LLaVA-1.5-LLaMA3-8B) |
| `org_caption` | string | Original web-crawled caption (often noisy alt-text) |
| `re_caption_condition_diverse_topk` | string | Condition-diverse caption variant (v2) |
| `sha256` | string | Image content hash |
| `key` | string | Row key from source dataset |
| `re_clip_score` / `org_clip_score` | float | CLIP alignment scores for re/original captions |
| `re_length` / `org_length` | int | Caption token lengths |
| `re_gpt4v_score` / `org_gpt4v_score` | int | GPT-4V quality scores |
| `re_condition_length` | int | Condition caption token length |
### Stage 7 columns (FoodExtract enrichment)
| Field | Type | Description |
|---|---|---|
| `food_extract_is_food_or_drink_re` | bool | FoodExtract classification on `re_caption` — `true` if food/drink, `false` otherwise, `null` on failure |
| `food_extract_raw_re` | string | Raw condensed FoodExtract output from `re_caption` |
| `food_extract_json_re` | string | Parsed JSON from `re_caption` with keys: `is_food_or_drink`, `tags`, `food_items`, `drink_items` |
| `food_extract_is_food_or_drink_org` | bool | FoodExtract classification on `org_caption` — `true` if food/drink, `false` otherwise, `null` on failure |
| `food_extract_raw_org` | string | Raw condensed FoodExtract output from `org_caption` |
| `food_extract_json_org` | string | Parsed JSON from `org_caption` with keys: `is_food_or_drink`, `tags`, `food_items`, `drink_items` |
## Confidence distribution (Stage 5 score)
| Score range | Count | Percentage |
|---|---|---|
| 0.50-0.60 | 3,842,281 | 3.6% |
| 0.60-0.70 | 4,173,921 | 3.9% |
| 0.70-0.80 | 5,070,591 | 4.8% |
| 0.80-0.90 | 7,534,240 | 7.1% |
| 0.90-0.95 | 6,681,439 | 6.3% |
| 0.95-1.00 | 78,927,685 | 74.3% |
| **Average** | **0.9396** | |
Higher scores indicate more confident predictions. Recommendations:
- **General use:** `score >= 0.80` (good balance of coverage and precision)
- **High precision:** `score >= 0.95` (removes most edge cases)
- **Maximum recall:** no filter (includes borderline cases)
- **Best quality:** combine `score >= 0.95` with `food_extract_is_food_or_drink_re == True`
## Model details
### Stage 5: Text classifier
| | |
|---|---|
| **Model** | [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) |
| **Architecture** | Ettin-encoder-150m (ModernBERT-based, 150M params) |
| **Training data** | [mrdbourke/food-or-drink-10m](https://huggingface.co/datasets/mrdbourke/food-or-drink-10m) + [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) |
| **Inference** | fp16, ~4,900 rows/s (NVIDIA RTX 4090) |
| **Test accuracy** | 94.6% |
| **Test F1** | 0.9475 |
### Stage 7: FoodExtract
| | |
|---|---|
| **Model** | [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) |
| **Architecture** | Gemma 3 270M (fine-tuned with SFT via TRL) |
| **Training data** | [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) (135K samples from gpt-oss-120b) |
| **Inference** | vLLM server, ~1,300 rows/s (NVIDIA RTX 4090) |
| **Capabilities** | Binary classification, category tagging, food/drink item extraction |
## What's classified as food or drink?
The classifier detects captions describing:
- **Food**: meals, dishes, ingredients, recipes, snacks, desserts, baked goods, produce, meat, seafood
- **Drinks**: coffee, tea, juice, wine, beer, cocktails, smoothies, soda, water, milk
- **Food scenes**: restaurant tables, grocery stores, kitchen cooking, food photography, menus
Common edge cases that may be included:
- Food packaging and product labels
- Food-themed art and illustrations
- Kitchenware and dining settings (plates, cups, teapots)
The FoodExtract columns help identify and filter these edge cases — rows where `label == "food_or_drink"` but `food_extract_is_food_or_drink_re == False` are likely false positives from Stage 5.
## Limitations
- Stage 5 labels are from an automated classifier (~95% accuracy), not human-annotated
- FoodExtract classifications are from a 270M parameter model and may contain errors
- ~5% of Stage 5 rows may be misclassified (food-shaped objects, food packaging, food-themed art)
- The `not_food_or_drink` rows are a 10% random sample, not the complete set
- Image URLs may become stale over time (images removed from source)
- Captions describe images but have not been verified for factual accuracy
- Food item extraction may miss items or include non-food items in complex captions
## Source & License
- **Source dataset**: [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) (`condition_diverse_topk` subset)
- **Stage 5 classifier**: [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier)
- **Stage 7 FoodExtract model**: [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2)
- **Training datasets**: [mrdbourke/food-or-drink-10m](https://huggingface.co/datasets/mrdbourke/food-or-drink-10m), [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k)
- **License**: Apache 2.0 (consistent with source dataset and all models used)
提供机构:
mrdbourke



