five

mrdbourke/food-or-drink-10m

收藏
Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrdbourke/food-or-drink-10m
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: url dtype: string - name: re_caption dtype: string - name: org_caption dtype: string - name: sha256 dtype: string - name: key dtype: string - name: re_clip_score dtype: float64 - name: org_clip_score dtype: float64 - name: re_length dtype: int64 - name: org_length dtype: int64 - name: re_gpt4v_score dtype: int64 - name: org_gpt4v_score dtype: int64 - name: re_caption_condition_diverse_topk dtype: string - name: re_condition_length dtype: int64 - name: label dtype: class_label: names: '0': food_or_drink '1': not_food_or_drink - name: score dtype: float64 splits: - name: train num_examples: 1415619 - name: test num_examples: 157292 configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: test path: data/test-*.parquet license: apache-2.0 task_categories: - text-classification - zero-shot-classification language: - en tags: - food - drink - food-classification - caption-classification - modernbert - knowledge-distillation size_categories: - 1M<n<10M --- # Food or Drink 10M A balanced binary classification dataset for detecting **food or drink** content in image captions. ## Overview | | Count | |---|---| | **Total rows** | 1,572,911 | | **Train split** | 1,415,619 (90%) | | **Test split** | 157,292 (10%) | | **Food/drink rows** | 785,980 (50.0%) | | **Not food/drink rows** | 786,931 (50.0%) | ## How it was made 1. **Source**: [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) (`condition_diverse_topk` subset) 2. **Teacher model**: [MoritzLaurer/ModernBERT-large-zeroshot-v2.0](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0) — a 400M parameter zero-shot NLI classifier 3. **Classification**: 10M captions streamed and classified as `food_or_drink` / `not_food_or_drink` using zero-shot NLI with candidate labels `["food or drink", "not food or drink"]` 4. **Balancing**: All food/drink rows saved (100%), not-food/drink rows dynamically sampled to match — resulting in a ~50/50 balanced dataset ## Labels - **`food_or_drink`** — caption describes food, beverages, meals, ingredients, drinks, or food/drink items - **`not_food_or_drink`** — caption describes anything else (objects, scenes, people, animals, etc.) ## Fields | Field | Description | |---|---| | `url` | Original image URL from DataComp-1B | | `re_caption` | AI-generated re-caption (detailed, descriptive) | | `org_caption` | Original caption (often noisy alt-text) | | `re_caption_condition_diverse_topk` | Condition-diverse re-caption variant | | `label` | Classification label: `food_or_drink` or `not_food_or_drink` | | `score` | Teacher model confidence score (0.5–1.0) | | `sha256` | Image content hash | | `key` | Row key from source dataset | | `re_clip_score` / `org_clip_score` | CLIP alignment scores | | `re_length` / `org_length` | Caption token lengths | | `re_gpt4v_score` / `org_gpt4v_score` | GPT-4V quality scores | | `re_condition_length` | Condition caption token length | ## Intended use This dataset is designed for: - **Knowledge distillation**: Fine-tuning smaller encoders (e.g., [Ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m)) to replicate the teacher model's classification, then running the fine-tuned model on the full 1B-row dataset - **Food/drink content filtering**: Filtering large-scale image-text datasets for food and drink related content - **Caption classification research**: Studying food/drink detection in image captions ## Caption types The dataset contains multiple caption styles per image, useful for training robust classifiers: - **`re_caption`**: Clean, AI-generated descriptions (e.g., *"A glass of red wine next to a cheese board on a wooden table"*) - **`org_caption`**: Original noisy alt-text (e.g., *"IMG_2847 wine night"*) - **`re_caption_condition_diverse_topk`**: Longer, more detailed AI captions ## Example samples ### Food or drink | label | score | re_caption | org_caption | |---|---|---|---| | `food_or_drink` | 0.9988 | A glass of minty chocolate latte with a straw is placed on a yellow surface next to a bag of coffee beans. | Mint Chocolate Coffee Frappe Recipe | | `food_or_drink` | 0.9978 | A plate of food with a mix of lettuce, meat, and sauce. The plate is on a dining table with a glass of orange juice and a bowl of salad. | Summer lunch spread with fresh OJ | ### Not food or drink | label | score | re_caption | org_caption | |---|---|---|---| | `not_food_or_drink` | 0.9998 | A row of identical figures in black suits and ties is standing in a line against a white background. | O 6-Car Flat End Offset Hopper Car Set, B&O 727022 | | `not_food_or_drink` | 0.9997 | A white wainscoting panel with a decorative molding at the top and a plain lower section is attached to a wall. | How to Install Board and Batten Wainscoting (White Painted Square over Rectangle Pattern) | ## Usage ```python from datasets import load_dataset # Load the full dataset ds = load_dataset("mrdbourke/food-or-drink-10m") # Access splits train = ds["train"] test = ds["test"] print(f"Train: {len(train):,} rows") print(f"Test: {len(test):,} rows") # Look at a sample print(train[0]) # Filter to food/drink only food_only = train.filter(lambda x: x["label"] == 0) # 0 = food_or_drink print(f"Food/drink rows: {len(food_only):,}") # Stream instead of downloading ds_stream = load_dataset("mrdbourke/food-or-drink-10m", split="train", streaming=True) for row in ds_stream: print(row["re_caption"], row["label"]) break ``` ## Confidence scores The `score` field contains the teacher model's confidence. Higher scores indicate more certain classifications: | Score range | Meaning | |---|---| | 0.95–1.0 | Very confident | | 0.80–0.95 | Confident | | 0.60–0.80 | Moderate confidence | | 0.50–0.60 | Low confidence (borderline) | ## License Apache 2.0 — same as the source dataset and teacher model.
提供机构:
mrdbourke
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作