mrdbourke/DataComp-1B-food-and-drink-3M
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrdbourke/DataComp-1B-food-and-drink-3M
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: image
dtype: image
- name: key
dtype: string
- name: url
dtype: string
- name: caption
dtype: string
- name: sha256
dtype: string
- name: image_width
dtype: int32
- name: image_height
dtype: int32
- name: text_label
dtype: string
- name: text_score
dtype: float32
- name: food_extract_is_food_or_drink_re
dtype: bool
- name: siglip2_label
dtype: string
- name: siglip2_is_food_or_drink
dtype: bool
- name: siglip2_score
dtype: float32
- name: siglip2_task
dtype: string
- name: siglip2_top_prompt
dtype: string
- name: siglip2_food_score
dtype: float32
- name: siglip2_not_food_score
dtype: float32
- name: siglip2_top5_prompts
dtype: string
- name: text_siglip2_agreement
dtype: bool
- name: food_extract_siglip2_agreement
dtype: bool
- name: quality_tier
dtype: string
- name: siglip2_embedding
sequence: float32
- name: shard_id
dtype: int32
configs:
- config_name: default
data_files:
- split: train
path: data/*.parquet
license: cc-by-4.0
task_categories:
- image-classification
- zero-shot-image-classification
tags:
- food
- drink
- nutrition
- siglip2
- datacomp
- food-classification
- embeddings
- nutrify
size_categories:
- 1M<n<10M
---
# DataComp-1B Food and Drink 3M
~3,108,047 food and not-food images extracted from [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B), each classified by **three independent signals** and accompanied by [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch16-512) embeddings (1,152-dim). Built for training food/drink classifiers, building FAISS search indices, and as a foundation for the [Nutrify VLM](https://nutrify.app) — an on-device vision-language model for nutrition tracking.
## How this dataset was made
### The problem
[Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) contains **1 billion** image-caption pairs from web crawls. Somewhere in there are tens of millions of food and drink images — but which ones? There are no labels, the captions are noisy, and downloading all 1B images to check visually isn't practical.
### The approach: knowledge distillation + multi-signal agreement
We used a cascading pipeline where each stage is fast and cheap, progressively filtering 1B rows down to ~3M verified food/drink images:
```
1B captions ──→ text classifier (fast) ──→ 106M food rows
│
FoodExtract (LLM) ────┘
│
5M URLs sampled ─┘
│
3.1M images downloaded ────┘
│
SigLIP2 zero-shot (92 prompts) ┘
│
3 signals compared ────────┘
│
this dataset ────────┘
```
### Stage-by-stage breakdown
**Stage 1-4: Text classification (1B → 106M rows)**
A large zero-shot text classifier ([ModernBERT-large-zeroshot](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0), 400M params) labeled 10M sample captions as "food or drink" / "not food or drink". A smaller student model ([ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m)) was fine-tuned on those labels, achieving 94.6% accuracy at ~4,900 rows/sec on an RTX 4090. The student then classified all 1 billion rows, producing 106M candidate food/drink rows.
**Stage 7: FoodExtract (second signal)**
A fine-tuned [Gemma 3 270M model](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) performed structured food/drink extraction on each caption, independently determining whether the caption describes food or drink and extracting specific items. This gave us a second, independent signal for each of the 106M rows.
**Stage 10: Image download (106M → 3.1M images)**
We sampled 5M URLs (2.5M high-confidence food + 2.5M not-food) from the 106M rows, prioritizing rows where both the text classifier (score ≥ 0.95) and FoodExtract agreed. Images were downloaded at 512px using [img2dataset](https://github.com/rom1504/img2dataset) with the default opt-out header respect (`X-Robots-Tag: noai/noimageai` excluded). At 61.7% URL success rate (typical for years-old web crawl URLs), this yielded ~3.1M images.
**Stage 11: SigLIP2 zero-shot classification (third signal)**
Every downloaded image was classified by [SigLIP2-so400m-patch16-512](https://huggingface.co/google/siglip2-so400m-patch16-512) (878M params) using **92 carefully designed zero-shot prompts** — 44 food/drink prompts and 48 not-food prompts. The prompts were crafted to:
- Cover all Nutrify VLM target tasks: general food photos, nutrition panels, ingredient lists, recipes, menus, food packaging
- Include hard negatives that commonly confuse food classifiers: cleaning products, bathroom products, automotive fluids, pet food, empty containers, cosmetics
- Avoid food-related words in negative prompts (e.g. "kitchen" or "cooking") that would pull food images toward the wrong class
Each image also had its **1,152-dimensional embedding** saved for downstream similarity search and FAISS indexing.
See `prompt_taxonomy.json` in this repo for the full prompt list with task mappings.
**Stage 11.5: Multi-signal agreement**
With three independent classifications per image (text classifier on caption, FoodExtract on caption, SigLIP2 on image), we computed agreement. The `quality_tier` column reflects how many signals agree:
| Tier | Criteria | Count | % |
|------|----------|-------|---|
| gold | All 3 signals agree | 2,741,057 | 88.2% |
| silver | 2 of 3 agree | 366,990 | 11.8% |
| bronze | Max disagreement | 0 | 0.0% |
The **gold tier** rows (88.2%) are high-confidence training data where text understanding and visual understanding independently reached the same conclusion.
### Why three signals?
Each signal has different strengths and failure modes:
| Signal | Strength | Weakness |
|--------|----------|----------|
| Text classifier | Fast, runs on captions (no images needed) | Misses visual context (a "jar" caption could be food or candles) |
| FoodExtract | Understands food items specifically | Also caption-only, misses when captions are wrong |
| SigLIP2 | Sees the actual image | Zero-shot = lower precision than fine-tuned, struggles with ambiguous images |
When all three agree, we're confident. When they disagree, those are the genuinely hard cases — food packaging that looks like cleaning products, food art that looks like paintings, blurry images of ambiguous objects. The `quality_tier` column lets you choose your confidence threshold.
## Key numbers
| Metric | Value |
|--------|-------|
| Total images | 3,108,047 |
| Food / drink (SigLIP2) | ~1.4M (45.0%) |
| Not food / drink (SigLIP2) | ~1.7M (55.0%) |
| Image resolution | 512px (shortest edge, keep aspect ratio) |
| Text↔SigLIP2 agreement | 90.2% |
| FoodExtract↔SigLIP2 agreement | 88.7% |
| Triple agreement (gold tier) | 88.2% |
| Embedding model | google/siglip2-so400m-patch16-512 |
| Embedding dim | 1,152 |
## Task distribution
The `siglip2_task` column pre-sorts food images by the type of content, mapped to [Nutrify VLM](https://nutrify.app) training tasks:
| Task tag | Description | Count | Use case |
|----------|-------------|-------|----------|
| task2_food_image | General food/drink photos | ~1.06M | Train food recognition |
| food_adjacent | Food packaging, grocery, cooking | ~257K | Product photo understanding |
| task6_ingredients_list | Ingredients on packaging | ~27K | Ingredient extraction |
| task7_recipe | Recipes in cookbooks/cards | ~24K | Recipe parsing |
| task8_menu | Restaurant menus | ~23K | Menu extraction |
| task5_nutrition_panel | Nutrition facts labels | ~4.7K | Nutrition panel reading |
## Column reference
| Column | Type | Description |
|--------|------|-------------|
| `image` | Image | The image (512px, loaded as PIL) |
| `key` | string | img2dataset sample key |
| `url` | string | Original source URL from DataComp-1B |
| `caption` | string | AI-generated caption (LLaVA-1.5-LLaMA3-8B) |
| `sha256` | string | Image content hash (for deduplication) |
| `image_width` | int32 | Downloaded image width in pixels |
| `image_height` | int32 | Downloaded image height in pixels |
| `text_label` | string | Stage 5 text classifier: food_or_drink / not_food_or_drink |
| `text_score` | float32 | Stage 5 classifier confidence (0.5-1.0) |
| `food_extract_is_food_or_drink_re` | bool | FoodExtract-v2 on re_caption |
| `siglip2_label` | string | SigLIP2 predicted label (from top-scoring prompt) |
| `siglip2_is_food_or_drink` | bool | True if SigLIP2 says food/drink |
| `siglip2_score` | float32 | Similarity score of the winning prompt |
| `siglip2_task` | string | VLM task tag of winning prompt |
| `siglip2_top_prompt` | string | The winning prompt text |
| `siglip2_food_score` | float32 | Max similarity across all 44 food prompts |
| `siglip2_not_food_score` | float32 | Max similarity across all 48 not-food prompts |
| `siglip2_top5_prompts` | string (JSON) | Top 5 prompts with scores |
| `text_siglip2_agreement` | bool | text_label matches siglip2_label |
| `food_extract_siglip2_agreement` | bool | FoodExtract matches SigLIP2 |
| `quality_tier` | string | gold / silver / bronze (signal agreement) |
| `siglip2_embedding` | list[float32] | 1,152-dim normalized image embedding |
| `shard_id` | int32 | Source webdataset shard number |
## Usage examples
```python
from datasets import load_dataset
# Stream the dataset (no full download needed)
ds = load_dataset("mrdbourke/DataComp-1B-food-and-drink-3M", streaming=True)
# High-confidence food images only (gold tier)
gold_food = ds["train"].filter(
lambda x: x["quality_tier"] == "gold" and x["siglip2_is_food_or_drink"]
)
# Get nutrition panel images for VLM training
panels = ds["train"].filter(
lambda x: x["siglip2_task"] == "task5_nutrition_panel"
)
# Get all menu images
menus = ds["train"].filter(
lambda x: x["siglip2_task"] == "task8_menu"
)
# Access embeddings for FAISS similarity search
import numpy as np
embeddings = []
for row in ds["train"].take(1000):
embeddings.append(row["siglip2_embedding"])
embeddings = np.array(embeddings) # [1000, 1152]
# Find disagreement cases (interesting edge cases)
disagree = ds["train"].filter(
lambda x: not x["text_siglip2_agreement"]
)
```
## Ethical considerations
- **Opt-out respected**: Images were downloaded using [img2dataset](https://github.com/rom1504/img2dataset) which respects `X-Robots-Tag: noai`, `noimageai`, and `noimageindex` HTTP headers by default. Images from servers that opted out of AI training were excluded.
- **Source attribution**: Every image retains its original `url` and `sha256` for provenance tracking.
- **Web crawl data**: Images originate from public web pages indexed in [DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B). The dataset inherits any biases present in web crawl data.
## Hardware
All processing was done on a single machine:
- NVIDIA RTX 4090 (24GB VRAM)
- Intel i9-14900KF
- 94GB RAM
- Ubuntu Linux
Total compute for this dataset: ~9.3 hours for SigLIP2 inference, ~12 hours for image download.
## Prompt taxonomy
SigLIP2 classification used 92 zero-shot prompts (44 food, 48 not-food). The full list with task mappings is in `prompt_taxonomy.json`.
Key design decisions:
- Food prompts cover all Nutrify VLM tasks: plates of food, drinks, nutrition panels, menus, recipes, ingredients, food packaging/products
- Not-food prompts target common false positives: cleaning products, cosmetics, automotive fluids, pet supplies, empty containers, kitchenware
- Negative prompts avoid food-related words that could confuse SigLIP2's semantic similarity (e.g. "utensils on a shelf" instead of "cooking utensils without food")
## Related resources
| Resource | Description |
|----------|-------------|
| [mrdbourke/Recap-DataComp-1B-FoodOrDrink](https://huggingface.co/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink) | 106M text-filtered rows (parent dataset) |
| [mrdbourke/food-drink-items-1B](https://huggingface.co/datasets/mrdbourke/food-drink-items-1B) | 293K enriched food/drink items with categories |
| [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) | Text classifier used for Stage 5 |
| [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) | FoodExtract model used for Stage 7 |
| [google/siglip2-so400m-patch16-512](https://huggingface.co/google/siglip2-so400m-patch16-512) | Vision model used for Stage 11 |
| [nutrify.app](https://nutrify.app) | The food tracking app this pipeline supports |
## Citation
```bibtex
@dataset{bourke2026datacomp1b_food_drink,
author = {Bourke, Daniel},
title = {DataComp-1B Food and Drink 3M},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/mrdbourke/DataComp-1B-food-and-drink-3M}
}
```
## License
We distribute our annotations, labels, embeddings, and metadata under a [Creative Commons CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license. The individual images are under their own copyrights. The original image URL-text samples and metadata were released by DataComp under CC-BY-4.0 ([source](https://huggingface.co/datasets/mlfoundations/datacomp_1b)).
By using this dataset, you assume all risks related to the use of the images, including but not limited to copyright limitations accompanying such content. Each image retains its original `url` and `sha256` for provenance tracking.
提供机构:
mrdbourke



