mrdbourke/DataComp-1B-food-and-drink-3M

Name: mrdbourke/DataComp-1B-food-and-drink-3M
Creator: mrdbourke
Published: 2026-03-25 08:17:36
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/mrdbourke/DataComp-1B-food-and-drink-3M

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: image dtype: image - name: key dtype: string - name: url dtype: string - name: caption dtype: string - name: sha256 dtype: string - name: image_width dtype: int32 - name: image_height dtype: int32 - name: text_label dtype: string - name: text_score dtype: float32 - name: food_extract_is_food_or_drink_re dtype: bool - name: siglip2_label dtype: string - name: siglip2_is_food_or_drink dtype: bool - name: siglip2_score dtype: float32 - name: siglip2_task dtype: string - name: siglip2_top_prompt dtype: string - name: siglip2_food_score dtype: float32 - name: siglip2_not_food_score dtype: float32 - name: siglip2_top5_prompts dtype: string - name: text_siglip2_agreement dtype: bool - name: food_extract_siglip2_agreement dtype: bool - name: quality_tier dtype: string - name: siglip2_embedding sequence: float32 - name: shard_id dtype: int32 configs: - config_name: default data_files: - split: train path: data/*.parquet license: cc-by-4.0 task_categories: - image-classification - zero-shot-image-classification tags: - food - drink - nutrition - siglip2 - datacomp - food-classification - embeddings - nutrify size_categories: - 1M<n<10M --- # DataComp-1B Food and Drink 3M ~3,108,047 food and not-food images extracted from [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B), each classified by **three independent signals** and accompanied by [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch16-512) embeddings (1,152-dim). Built for training food/drink classifiers, building FAISS search indices, and as a foundation for the [Nutrify VLM](https://nutrify.app) — an on-device vision-language model for nutrition tracking. ## How this dataset was made ### The problem [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) contains **1 billion** image-caption pairs from web crawls. Somewhere in there are tens of millions of food and drink images — but which ones? There are no labels, the captions are noisy, and downloading all 1B images to check visually isn't practical. ### The approach: knowledge distillation + multi-signal agreement We used a cascading pipeline where each stage is fast and cheap, progressively filtering 1B rows down to ~3M verified food/drink images: ``` 1B captions ──→ text classifier (fast) ──→ 106M food rows │ FoodExtract (LLM) ────┘ │ 5M URLs sampled ─┘ │ 3.1M images downloaded ────┘ │ SigLIP2 zero-shot (92 prompts) ┘ │ 3 signals compared ────────┘ │ this dataset ────────┘ ``` ### Stage-by-stage breakdown **Stage 1-4: Text classification (1B → 106M rows)** A large zero-shot text classifier ([ModernBERT-large-zeroshot](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0), 400M params) labeled 10M sample captions as "food or drink" / "not food or drink". A smaller student model ([ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m)) was fine-tuned on those labels, achieving 94.6% accuracy at ~4,900 rows/sec on an RTX 4090. The student then classified all 1 billion rows, producing 106M candidate food/drink rows. **Stage 7: FoodExtract (second signal)** A fine-tuned [Gemma 3 270M model](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) performed structured food/drink extraction on each caption, independently determining whether the caption describes food or drink and extracting specific items. This gave us a second, independent signal for each of the 106M rows. **Stage 10: Image download (106M → 3.1M images)** We sampled 5M URLs (2.5M high-confidence food + 2.5M not-food) from the 106M rows, prioritizing rows where both the text classifier (score ≥ 0.95) and FoodExtract agreed. Images were downloaded at 512px using [img2dataset](https://github.com/rom1504/img2dataset) with the default opt-out header respect (`X-Robots-Tag: noai/noimageai` excluded). At 61.7% URL success rate (typical for years-old web crawl URLs), this yielded ~3.1M images. **Stage 11: SigLIP2 zero-shot classification (third signal)** Every downloaded image was classified by [SigLIP2-so400m-patch16-512](https://huggingface.co/google/siglip2-so400m-patch16-512) (878M params) using **92 carefully designed zero-shot prompts** — 44 food/drink prompts and 48 not-food prompts. The prompts were crafted to: - Cover all Nutrify VLM target tasks: general food photos, nutrition panels, ingredient lists, recipes, menus, food packaging - Include hard negatives that commonly confuse food classifiers: cleaning products, bathroom products, automotive fluids, pet food, empty containers, cosmetics - Avoid food-related words in negative prompts (e.g. "kitchen" or "cooking") that would pull food images toward the wrong class Each image also had its **1,152-dimensional embedding** saved for downstream similarity search and FAISS indexing. See `prompt_taxonomy.json` in this repo for the full prompt list with task mappings. **Stage 11.5: Multi-signal agreement** With three independent classifications per image (text classifier on caption, FoodExtract on caption, SigLIP2 on image), we computed agreement. The `quality_tier` column reflects how many signals agree: | Tier | Criteria | Count | % | |------|----------|-------|---| | gold | All 3 signals agree | 2,741,057 | 88.2% | | silver | 2 of 3 agree | 366,990 | 11.8% | | bronze | Max disagreement | 0 | 0.0% | The **gold tier** rows (88.2%) are high-confidence training data where text understanding and visual understanding independently reached the same conclusion. ### Why three signals? Each signal has different strengths and failure modes: | Signal | Strength | Weakness | |--------|----------|----------| | Text classifier | Fast, runs on captions (no images needed) | Misses visual context (a "jar" caption could be food or candles) | | FoodExtract | Understands food items specifically | Also caption-only, misses when captions are wrong | | SigLIP2 | Sees the actual image | Zero-shot = lower precision than fine-tuned, struggles with ambiguous images | When all three agree, we're confident. When they disagree, those are the genuinely hard cases — food packaging that looks like cleaning products, food art that looks like paintings, blurry images of ambiguous objects. The `quality_tier` column lets you choose your confidence threshold. ## Key numbers | Metric | Value | |--------|-------| | Total images | 3,108,047 | | Food / drink (SigLIP2) | ~1.4M (45.0%) | | Not food / drink (SigLIP2) | ~1.7M (55.0%) | | Image resolution | 512px (shortest edge, keep aspect ratio) | | Text↔SigLIP2 agreement | 90.2% | | FoodExtract↔SigLIP2 agreement | 88.7% | | Triple agreement (gold tier) | 88.2% | | Embedding model | google/siglip2-so400m-patch16-512 | | Embedding dim | 1,152 | ## Task distribution The `siglip2_task` column pre-sorts food images by the type of content, mapped to [Nutrify VLM](https://nutrify.app) training tasks: | Task tag | Description | Count | Use case | |----------|-------------|-------|----------| | task2_food_image | General food/drink photos | ~1.06M | Train food recognition | | food_adjacent | Food packaging, grocery, cooking | ~257K | Product photo understanding | | task6_ingredients_list | Ingredients on packaging | ~27K | Ingredient extraction | | task7_recipe | Recipes in cookbooks/cards | ~24K | Recipe parsing | | task8_menu | Restaurant menus | ~23K | Menu extraction | | task5_nutrition_panel | Nutrition facts labels | ~4.7K | Nutrition panel reading | ## Column reference | Column | Type | Description | |--------|------|-------------| | `image` | Image | The image (512px, loaded as PIL) | | `key` | string | img2dataset sample key | | `url` | string | Original source URL from DataComp-1B | | `caption` | string | AI-generated caption (LLaVA-1.5-LLaMA3-8B) | | `sha256` | string | Image content hash (for deduplication) | | `image_width` | int32 | Downloaded image width in pixels | | `image_height` | int32 | Downloaded image height in pixels | | `text_label` | string | Stage 5 text classifier: food_or_drink / not_food_or_drink | | `text_score` | float32 | Stage 5 classifier confidence (0.5-1.0) | | `food_extract_is_food_or_drink_re` | bool | FoodExtract-v2 on re_caption | | `siglip2_label` | string | SigLIP2 predicted label (from top-scoring prompt) | | `siglip2_is_food_or_drink` | bool | True if SigLIP2 says food/drink | | `siglip2_score` | float32 | Similarity score of the winning prompt | | `siglip2_task` | string | VLM task tag of winning prompt | | `siglip2_top_prompt` | string | The winning prompt text | | `siglip2_food_score` | float32 | Max similarity across all 44 food prompts | | `siglip2_not_food_score` | float32 | Max similarity across all 48 not-food prompts | | `siglip2_top5_prompts` | string (JSON) | Top 5 prompts with scores | | `text_siglip2_agreement` | bool | text_label matches siglip2_label | | `food_extract_siglip2_agreement` | bool | FoodExtract matches SigLIP2 | | `quality_tier` | string | gold / silver / bronze (signal agreement) | | `siglip2_embedding` | list[float32] | 1,152-dim normalized image embedding | | `shard_id` | int32 | Source webdataset shard number | ## Usage examples ```python from datasets import load_dataset # Stream the dataset (no full download needed) ds = load_dataset("mrdbourke/DataComp-1B-food-and-drink-3M", streaming=True) # High-confidence food images only (gold tier) gold_food = ds["train"].filter( lambda x: x["quality_tier"] == "gold" and x["siglip2_is_food_or_drink"] ) # Get nutrition panel images for VLM training panels = ds["train"].filter( lambda x: x["siglip2_task"] == "task5_nutrition_panel" ) # Get all menu images menus = ds["train"].filter( lambda x: x["siglip2_task"] == "task8_menu" ) # Access embeddings for FAISS similarity search import numpy as np embeddings = [] for row in ds["train"].take(1000): embeddings.append(row["siglip2_embedding"]) embeddings = np.array(embeddings) # [1000, 1152] # Find disagreement cases (interesting edge cases) disagree = ds["train"].filter( lambda x: not x["text_siglip2_agreement"] ) ``` ## Ethical considerations - **Opt-out respected**: Images were downloaded using [img2dataset](https://github.com/rom1504/img2dataset) which respects `X-Robots-Tag: noai`, `noimageai`, and `noimageindex` HTTP headers by default. Images from servers that opted out of AI training were excluded. - **Source attribution**: Every image retains its original `url` and `sha256` for provenance tracking. - **Web crawl data**: Images originate from public web pages indexed in [DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B). The dataset inherits any biases present in web crawl data. ## Hardware All processing was done on a single machine: - NVIDIA RTX 4090 (24GB VRAM) - Intel i9-14900KF - 94GB RAM - Ubuntu Linux Total compute for this dataset: ~9.3 hours for SigLIP2 inference, ~12 hours for image download. ## Prompt taxonomy SigLIP2 classification used 92 zero-shot prompts (44 food, 48 not-food). The full list with task mappings is in `prompt_taxonomy.json`. Key design decisions: - Food prompts cover all Nutrify VLM tasks: plates of food, drinks, nutrition panels, menus, recipes, ingredients, food packaging/products - Not-food prompts target common false positives: cleaning products, cosmetics, automotive fluids, pet supplies, empty containers, kitchenware - Negative prompts avoid food-related words that could confuse SigLIP2's semantic similarity (e.g. "utensils on a shelf" instead of "cooking utensils without food") ## Related resources | Resource | Description | |----------|-------------| | [mrdbourke/Recap-DataComp-1B-FoodOrDrink](https://huggingface.co/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink) | 106M text-filtered rows (parent dataset) | | [mrdbourke/food-drink-items-1B](https://huggingface.co/datasets/mrdbourke/food-drink-items-1B) | 293K enriched food/drink items with categories | | [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) | Text classifier used for Stage 5 | | [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) | FoodExtract model used for Stage 7 | | [google/siglip2-so400m-patch16-512](https://huggingface.co/google/siglip2-so400m-patch16-512) | Vision model used for Stage 11 | | [nutrify.app](https://nutrify.app) | The food tracking app this pipeline supports | ## Citation ```bibtex @dataset{bourke2026datacomp1b_food_drink, author = {Bourke, Daniel}, title = {DataComp-1B Food and Drink 3M}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/mrdbourke/DataComp-1B-food-and-drink-3M} } ``` ## License We distribute our annotations, labels, embeddings, and metadata under a [Creative Commons CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license. The individual images are under their own copyrights. The original image URL-text samples and metadata were released by DataComp under CC-BY-4.0 ([source](https://huggingface.co/datasets/mlfoundations/datacomp_1b)). By using this dataset, you assume all risks related to the use of the images, including but not limited to copyright limitations accompanying such content. Each image retains its original `url` and `sha256` for provenance tracking.

提供机构：

mrdbourke

5,000+

优质数据集

54 个

任务类型

进入经典数据集