mrdbourke/food-or-drink-10m
收藏Hugging Face2026-03-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mrdbourke/food-or-drink-10m
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: url
dtype: string
- name: re_caption
dtype: string
- name: org_caption
dtype: string
- name: sha256
dtype: string
- name: key
dtype: string
- name: re_clip_score
dtype: float64
- name: org_clip_score
dtype: float64
- name: re_length
dtype: int64
- name: org_length
dtype: int64
- name: re_gpt4v_score
dtype: int64
- name: org_gpt4v_score
dtype: int64
- name: re_caption_condition_diverse_topk
dtype: string
- name: re_condition_length
dtype: int64
- name: label
dtype:
class_label:
names:
'0': food_or_drink
'1': not_food_or_drink
- name: score
dtype: float64
splits:
- name: train
num_examples: 1415619
- name: test
num_examples: 157292
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: test
path: data/test-*.parquet
license: apache-2.0
task_categories:
- text-classification
- zero-shot-classification
language:
- en
tags:
- food
- drink
- food-classification
- caption-classification
- modernbert
- knowledge-distillation
size_categories:
- 1M<n<10M
---
# Food or Drink 10M
A balanced binary classification dataset for detecting **food or drink** content in image captions.
## Overview
| | Count |
|---|---|
| **Total rows** | 1,572,911 |
| **Train split** | 1,415,619 (90%) |
| **Test split** | 157,292 (10%) |
| **Food/drink rows** | 785,980 (50.0%) |
| **Not food/drink rows** | 786,931 (50.0%) |
## How it was made
1. **Source**: [UCSC-VLAA/Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) (`condition_diverse_topk` subset)
2. **Teacher model**: [MoritzLaurer/ModernBERT-large-zeroshot-v2.0](https://huggingface.co/MoritzLaurer/ModernBERT-large-zeroshot-v2.0) — a 400M parameter zero-shot NLI classifier
3. **Classification**: 10M captions streamed and classified as `food_or_drink` / `not_food_or_drink` using zero-shot NLI with candidate labels `["food or drink", "not food or drink"]`
4. **Balancing**: All food/drink rows saved (100%), not-food/drink rows dynamically sampled to match — resulting in a ~50/50 balanced dataset
## Labels
- **`food_or_drink`** — caption describes food, beverages, meals, ingredients, drinks, or food/drink items
- **`not_food_or_drink`** — caption describes anything else (objects, scenes, people, animals, etc.)
## Fields
| Field | Description |
|---|---|
| `url` | Original image URL from DataComp-1B |
| `re_caption` | AI-generated re-caption (detailed, descriptive) |
| `org_caption` | Original caption (often noisy alt-text) |
| `re_caption_condition_diverse_topk` | Condition-diverse re-caption variant |
| `label` | Classification label: `food_or_drink` or `not_food_or_drink` |
| `score` | Teacher model confidence score (0.5–1.0) |
| `sha256` | Image content hash |
| `key` | Row key from source dataset |
| `re_clip_score` / `org_clip_score` | CLIP alignment scores |
| `re_length` / `org_length` | Caption token lengths |
| `re_gpt4v_score` / `org_gpt4v_score` | GPT-4V quality scores |
| `re_condition_length` | Condition caption token length |
## Intended use
This dataset is designed for:
- **Knowledge distillation**: Fine-tuning smaller encoders (e.g., [Ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m)) to replicate the teacher model's classification, then running the fine-tuned model on the full 1B-row dataset
- **Food/drink content filtering**: Filtering large-scale image-text datasets for food and drink related content
- **Caption classification research**: Studying food/drink detection in image captions
## Caption types
The dataset contains multiple caption styles per image, useful for training robust classifiers:
- **`re_caption`**: Clean, AI-generated descriptions (e.g., *"A glass of red wine next to a cheese board on a wooden table"*)
- **`org_caption`**: Original noisy alt-text (e.g., *"IMG_2847 wine night"*)
- **`re_caption_condition_diverse_topk`**: Longer, more detailed AI captions
## Example samples
### Food or drink
| label | score | re_caption | org_caption |
|---|---|---|---|
| `food_or_drink` | 0.9988 | A glass of minty chocolate latte with a straw is placed on a yellow surface next to a bag of coffee beans. | Mint Chocolate Coffee Frappe Recipe |
| `food_or_drink` | 0.9978 | A plate of food with a mix of lettuce, meat, and sauce. The plate is on a dining table with a glass of orange juice and a bowl of salad. | Summer lunch spread with fresh OJ |
### Not food or drink
| label | score | re_caption | org_caption |
|---|---|---|---|
| `not_food_or_drink` | 0.9998 | A row of identical figures in black suits and ties is standing in a line against a white background. | O 6-Car Flat End Offset Hopper Car Set, B&O 727022 |
| `not_food_or_drink` | 0.9997 | A white wainscoting panel with a decorative molding at the top and a plain lower section is attached to a wall. | How to Install Board and Batten Wainscoting (White Painted Square over Rectangle Pattern) |
## Usage
```python
from datasets import load_dataset
# Load the full dataset
ds = load_dataset("mrdbourke/food-or-drink-10m")
# Access splits
train = ds["train"]
test = ds["test"]
print(f"Train: {len(train):,} rows")
print(f"Test: {len(test):,} rows")
# Look at a sample
print(train[0])
# Filter to food/drink only
food_only = train.filter(lambda x: x["label"] == 0) # 0 = food_or_drink
print(f"Food/drink rows: {len(food_only):,}")
# Stream instead of downloading
ds_stream = load_dataset("mrdbourke/food-or-drink-10m", split="train", streaming=True)
for row in ds_stream:
print(row["re_caption"], row["label"])
break
```
## Confidence scores
The `score` field contains the teacher model's confidence. Higher scores indicate more certain classifications:
| Score range | Meaning |
|---|---|
| 0.95–1.0 | Very confident |
| 0.80–0.95 | Confident |
| 0.60–0.80 | Moderate confidence |
| 0.50–0.60 | Low confidence (borderline) |
## License
Apache 2.0 — same as the source dataset and teacher model.
提供机构:
mrdbourke



