five

mrdbourke/Recap-DataComp-1B-FoodOrDrink

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrdbourke/Recap-DataComp-1B-FoodOrDrink
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: label dtype: string - name: score dtype: float64 - name: url dtype: string - name: re_caption dtype: string - name: org_caption dtype: string - name: sha256 dtype: string - name: key dtype: string - name: re_clip_score dtype: float64 - name: org_clip_score dtype: float64 - name: re_length dtype: int64 - name: org_length dtype: int64 - name: re_gpt4v_score dtype: int64 - name: org_gpt4v_score dtype: int64 - name: re_caption_condition_diverse_topk dtype: string - name: re_condition_length dtype: int64 - name: food_extract_is_food_or_drink_re dtype: bool - name: food_extract_raw_re dtype: string - name: food_extract_json_re dtype: string - name: food_extract_is_food_or_drink_org dtype: bool - name: food_extract_raw_org dtype: string - name: food_extract_json_org dtype: string splits: - name: train num_examples: 106230157 configs: - config_name: default data_files: - split: train path: data/train/*.parquet license: apache-2.0 task_categories: - text-classification - image-classification - zero-shot-classification language: - en tags: - food - drink - food-classification - food-extraction - datacomp - image-text - caption-classification - filtered-dataset size_categories: - 100M<n<1B source_datasets: - UCSC-VLAA/Recap-DataComp-1B --- # Recap-DataComp-1B: Food or Drink A filtered subset of [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) containing **106,230,157 rows** classified as food/drink content, enriched with structured food/drink extraction from [FoodExtract-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2). ## Overview | | Count | Percentage | |---|---|---| | **Total rows** | 106,230,157 | 100% | | **Food/drink** (Stage 5 label) | 96,618,895 | 91.0% | | **Not food/drink** (Stage 5 label) | 9,611,262 | 9.0% | | **FoodExtract (re_caption): food/drink** | 79,519,489 | 74.9% | | **FoodExtract (re_caption): not food/drink** | 26,710,156 | 25.1% | | **FoodExtract (re_caption): null (failed)** | 512 | 0.00% | | **Label vs FoodExtract (re_caption) agree** | 88,114,651 | 82.9% | | **Label vs FoodExtract (re_caption) disagree** | 18,114,994 | 17.1% | | **FoodExtract (org_caption): food/drink** | 67,807,188 | 63.8% | | **FoodExtract (org_caption): not food/drink** | 38,409,170 | 36.2% | | **FoodExtract (org_caption): null (failed)** | 13,799 | 0.01% | | **Source shards processed** | 4,627 / 4,627 | 100.0% | | **Files** | 4,627 parquet files | | | **Size on disk** | 59.43 GB | | ## How it was made This dataset was created through a multi-stage knowledge distillation and enrichment pipeline: 1. **Teacher labeling** (Stage 1): [ModernBERT-large-zeroshot-v2.0](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0) (400M params) classified 10M captions from Recap-DataComp-1B as `food or drink` / `not food or drink` using zero-shot NLI 2. **Training data** (Stage 2): The labeled 10M rows were balanced 50/50 and uploaded as [mrdbourke/food-or-drink-10m](https://huggingface.co/datasets/mrdbourke/food-or-drink-10m) (1.57M rows) 3. **Student fine-tuning** (Stage 3): [jhu-clsp/ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m) was fine-tuned on the training data + [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) to create [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) (94.6% accuracy, 0.9475 F1) 4. **Full inference** (Stage 5): The fine-tuned classifier processed all **1 billion rows** of Recap-DataComp-1B (`condition_diverse_topk` subset) at ~4,900 rows/s on an RTX 4090 5. **Filtering** (Stage 5): All food/drink rows saved (100%) + 10% random sample of not-food/drink rows 6. **FoodExtract enrichment** (Stage 7): Every row was processed through [FoodExtract-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) (Gemma 3 270M fine-tuned on 135K samples) to extract structured food/drink items, tags, and an independent food/drink classification ### Pipeline diagram ``` Recap-DataComp-1B (1B rows, condition_diverse_topk subset) │ ├── 10M sample ──→ ModernBERT-large zero-shot NLI (teacher) │ │ │ ▼ │ food-or-drink-10m (1.57M balanced labels) │ │ │ ▼ │ Fine-tune Ettin-encoder-150m (student) │ │ ▼ ▼ Full 1B rows ──→ Ettin classifier (4,900 rows/s, fp16) │ ▼ ~110M classified rows (label + score) │ ▼ FoodExtract-v2 (Gemma 3 270M, ~1,300 rows/s) │ ▼ This dataset (106.2M rows, enriched) ``` ## FoodExtract enrichment (Stage 7 + 7.5) After the initial text classification (Stage 5), every row was processed through [FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2), a 270M parameter language model fine-tuned on [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) (135K samples labeled by `gpt-oss-120b`). FoodExtract does three things for each caption: 1. **Re-classifies** the text as food/drink or not (independent of the Stage 5 label) 2. **Tags** the text with category labels (e.g. `fi` = food items, `di` = drink items, `re` = recipe, `me` = menu, `il` = ingredient list, `np` = nutrition panel, `fa` = food advertisement, `fp` = food packaging) 3. **Extracts** specific food and drink item names as structured lists FoodExtract was run on **both** caption types: - **`re_caption`** (AI-generated detailed captions from LLaVA) → `_re` suffix columns — produces generic visual descriptions (e.g. "meat", "vegetables", "sauce") - **`org_caption`** (original web-crawled alt-text) → `_org` suffix columns — produces specific named items (e.g. "wagyu ribeye", "pad thai", "marinara") Using both gives complementary coverage: `re_caption` for visual content and `org_caption` for specific named items. This enables fine-grained filtering beyond binary classification — for example, finding all rows that contain recipes, or searching for specific ingredients across the dataset. ### FoodExtract (re_caption) vs Stage 5 label agreement | | Count | Percentage | |---|---|---| | **Agree** (both say same) | 88,114,651 | 82.9% | | **Disagree** (different) | 18,114,994 | 17.1% | | label=food, FoodExtract=not food | 17,606,960 | 16.6% | | label=not_food, FoodExtract=food | 508,034 | 0.5% | The `label=food, FoodExtract=not food` rows are Stage 5 false positives — captions that mention food-adjacent items (jars, packaging, kitchenware) but don't actually describe food or drink. These can be filtered out using `food_extract_is_food_or_drink_re`. ### Tag dictionary | Tag | Meaning | |---|---| | `fi` | Food items | | `di` | Drink items | | `re` | Recipe | | `me` | Menu | | `il` | Ingredient list | | `np` | Nutrition panel | | `fa` | Food advertisement | | `fp` | Food packaging | ### FoodExtract JSON schema Both `food_extract_json_re` and `food_extract_json_org` contain a JSON string with this structure: ```json { "is_food_or_drink": true, "tags": ["fi", "di"], "food_items": ["bacon", "eggs", "toast"], "drink_items": ["orange juice"] } ``` ## Usage ```python from datasets import load_dataset # Stream the dataset (recommended for this size) ds = load_dataset("mrdbourke/Recap-DataComp-1B-FoodOrDrink", split="train", streaming=True) for row in ds: print(row["label"], row["score"], row["re_caption"][:80]) break # Load fully into memory (requires ~59.43 GB RAM) ds = load_dataset("mrdbourke/Recap-DataComp-1B-FoodOrDrink", split="train") print(f"Total rows: {len(ds):,}") # Filter to food/drink only (Stage 5 label) food = ds.filter(lambda x: x["label"] == "food_or_drink") # Filter by confidence (recommended for high-precision applications) high_conf_food = ds.filter(lambda x: x["label"] == "food_or_drink" and x["score"] >= 0.95) # Get the not-food/drink evaluation sample not_food = ds.filter(lambda x: x["label"] == "not_food_or_drink") # Use FoodExtract columns for fine-grained filtering import json # Filter to rows where BOTH classifiers agree it's food (re_caption) high_conf = ds.filter(lambda x: x["label"] == "food_or_drink" and x["food_extract_is_food_or_drink_re"] == True) # Extract specific food items from re_caption row = ds[0] fe_re = json.loads(row["food_extract_json_re"]) print(f"Foods (re_caption): {fe_re['food_items']}") print(f"Drinks (re_caption): {fe_re['drink_items']}") # Extract specific food items from org_caption (often more specific names) fe_org = json.loads(row["food_extract_json_org"]) print(f"Foods (org_caption): {fe_org['food_items']}") print(f"Drinks (org_caption): {fe_org['drink_items']}") # Combine both caption types for maximum coverage all_foods = set(fe_re['food_items'] + fe_org['food_items']) # Find all rows tagged as recipes recipes = ds.filter(lambda x: '"re"' in (x["food_extract_json_re"] or "")) # Remove Stage 5 false positives (label=food but FoodExtract says no) cleaned = ds.filter(lambda x: x["food_extract_is_food_or_drink_re"] == True) ``` ## Fields ### Stage 5 columns (text classification) | Field | Type | Description | |---|---|---| | `label` | string | `food_or_drink` or `not_food_or_drink` — from Stage 5 Ettin classifier | | `score` | float | Classifier confidence (0.5–1.0) | | `url` | string | Original image URL from DataComp-1B | | `re_caption` | string | AI-generated detailed caption (LLaVA-1.5-LLaMA3-8B) | | `org_caption` | string | Original web-crawled caption (often noisy alt-text) | | `re_caption_condition_diverse_topk` | string | Condition-diverse caption variant (v2) | | `sha256` | string | Image content hash | | `key` | string | Row key from source dataset | | `re_clip_score` / `org_clip_score` | float | CLIP alignment scores for re/original captions | | `re_length` / `org_length` | int | Caption token lengths | | `re_gpt4v_score` / `org_gpt4v_score` | int | GPT-4V quality scores | | `re_condition_length` | int | Condition caption token length | ### Stage 7 columns (FoodExtract enrichment) | Field | Type | Description | |---|---|---| | `food_extract_is_food_or_drink_re` | bool | FoodExtract classification on `re_caption` — `true` if food/drink, `false` otherwise, `null` on failure | | `food_extract_raw_re` | string | Raw condensed FoodExtract output from `re_caption` | | `food_extract_json_re` | string | Parsed JSON from `re_caption` with keys: `is_food_or_drink`, `tags`, `food_items`, `drink_items` | | `food_extract_is_food_or_drink_org` | bool | FoodExtract classification on `org_caption` — `true` if food/drink, `false` otherwise, `null` on failure | | `food_extract_raw_org` | string | Raw condensed FoodExtract output from `org_caption` | | `food_extract_json_org` | string | Parsed JSON from `org_caption` with keys: `is_food_or_drink`, `tags`, `food_items`, `drink_items` | ## Confidence distribution (Stage 5 score) | Score range | Count | Percentage | |---|---|---| | 0.50-0.60 | 3,842,281 | 3.6% | | 0.60-0.70 | 4,173,921 | 3.9% | | 0.70-0.80 | 5,070,591 | 4.8% | | 0.80-0.90 | 7,534,240 | 7.1% | | 0.90-0.95 | 6,681,439 | 6.3% | | 0.95-1.00 | 78,927,685 | 74.3% | | **Average** | **0.9396** | | Higher scores indicate more confident predictions. Recommendations: - **General use:** `score >= 0.80` (good balance of coverage and precision) - **High precision:** `score >= 0.95` (removes most edge cases) - **Maximum recall:** no filter (includes borderline cases) - **Best quality:** combine `score >= 0.95` with `food_extract_is_food_or_drink_re == True` ## Model details ### Stage 5: Text classifier | | | |---|---| | **Model** | [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) | | **Architecture** | Ettin-encoder-150m (ModernBERT-based, 150M params) | | **Training data** | [mrdbourke/food-or-drink-10m](https://huggingface.co/datasets/mrdbourke/food-or-drink-10m) + [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) | | **Inference** | fp16, ~4,900 rows/s (NVIDIA RTX 4090) | | **Test accuracy** | 94.6% | | **Test F1** | 0.9475 | ### Stage 7: FoodExtract | | | |---|---| | **Model** | [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) | | **Architecture** | Gemma 3 270M (fine-tuned with SFT via TRL) | | **Training data** | [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) (135K samples from gpt-oss-120b) | | **Inference** | vLLM server, ~1,300 rows/s (NVIDIA RTX 4090) | | **Capabilities** | Binary classification, category tagging, food/drink item extraction | ## What's classified as food or drink? The classifier detects captions describing: - **Food**: meals, dishes, ingredients, recipes, snacks, desserts, baked goods, produce, meat, seafood - **Drinks**: coffee, tea, juice, wine, beer, cocktails, smoothies, soda, water, milk - **Food scenes**: restaurant tables, grocery stores, kitchen cooking, food photography, menus Common edge cases that may be included: - Food packaging and product labels - Food-themed art and illustrations - Kitchenware and dining settings (plates, cups, teapots) The FoodExtract columns help identify and filter these edge cases — rows where `label == "food_or_drink"` but `food_extract_is_food_or_drink_re == False` are likely false positives from Stage 5. ## Limitations - Stage 5 labels are from an automated classifier (~95% accuracy), not human-annotated - FoodExtract classifications are from a 270M parameter model and may contain errors - ~5% of Stage 5 rows may be misclassified (food-shaped objects, food packaging, food-themed art) - The `not_food_or_drink` rows are a 10% random sample, not the complete set - Image URLs may become stale over time (images removed from source) - Captions describe images but have not been verified for factual accuracy - Food item extraction may miss items or include non-food items in complex captions ## Source & License - **Source dataset**: [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) (`condition_diverse_topk` subset) - **Stage 5 classifier**: [mrdbourke/ettin-150m-food-or-drink-classifier](https://huggingface.co/mrdbourke/ettin-150m-food-or-drink-classifier) - **Stage 7 FoodExtract model**: [mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2](https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v2) - **Training datasets**: [mrdbourke/food-or-drink-10m](https://huggingface.co/datasets/mrdbourke/food-or-drink-10m), [mrdbourke/FoodExtract-135k](https://huggingface.co/datasets/mrdbourke/FoodExtract-135k) - **License**: Apache 2.0 (consistent with source dataset and all models used)
提供机构:
mrdbourke
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作