OGC_Cooking_Recipes
收藏魔搭社区2026-01-08 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_Cooking_Recipes
下载链接
链接失效反馈官方服务:
资源简介:
# VDR_Cooking_Recipes - Overview
## Dataset Summary
**VDR_Cooking_Recipes is a curated multimodal dataset focused on cooking recipe documents, culinary guides, and food preparation instructions. It combines text and image data extracted from real culinary PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.**
## Dataset Details
### Dataset Creation
This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet).
Cooking recipe PDFs were collected from public online sources, focusing primarily on cookbooks, culinary guides, and recipe collections from various cuisines and cooking traditions. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality.
The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic culinary queries with corresponding answers.
We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page.
### Data Fields
Each entry in the dataset contains:
- **`id`** (string): A unique identifier for the sample
- **`query`** (string): A synthetic culinary question generated from that page
- **`answer`** (string): A comprehensive answer to the corresponding query
- **`image`** (PIL.Image): A visual rendering of a PDF page
- **`language`** (string): The detected language of the query
### Data Generation
Each page produces 4 unique entries: a main culinary query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers.
## Supported Tasks
This dataset is designed to support:
- **Question Answering**: Training and evaluating models on culinary and recipe content
- **Visual Question Answering**: Multimodal understanding of recipe documents
- **Document Retrieval**: Developing search systems for culinary and recipe documents
- **Text Generation**: Automated question-answer generation from culinary sources
- **Domain-Specific Applications**: Recipe analysis, cooking assistance, and culinary knowledge understanding
## Dataset Use Cases
- Training and evaluating vision-language models on culinary and recipe content
- Developing multimodal search or retrieval systems for cooking and recipe documents
- Research in automated question-answer generation from culinary and recipe sources
- Enhancing tools for recipe analysis, cooking assistance, and culinary knowledge understanding
- Supporting culinary education and cooking assistant applications
## Dataset Curators
- **Yumeng Ye**
- **Léo Appourchaux**
# VDR_Cooking_Recipes - 概览
## 数据集概述
**VDR_Cooking_Recipes 是一个经过精心精选的多模态数据集,聚焦于烹饪食谱文档、烹饪指南与食品制备说明。该数据集整合了从真实烹饪PDF文档中提取的文本与图像数据,可用于支持检索增强生成文档搜索(RAG DSE)、问答、文档检索以及视觉语言模型训练等任务。**
## 数据集详情
### 数据集构建
本数据集通过开源工具 [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet) 构建完成。
研究人员从公开网络资源采集烹饪食谱PDF文档,重点涵盖各类菜系与烹饪传统的食谱集、烹饪指南及专业菜谱。每份文档在处理前均经过**人工清洗与精选**,包括移除空白页、标题页、目录以及其他无关内容,以保障数据集的最优质量。
清洗后的文档将逐页处理,提取文本内容、将页面转换为高分辨率图像,并生成对应的合成烹饪查询与答案。
我们在自定义流水线中使用谷歌的**Gemini 2.5 Flash**模型,生成与各页面内容匹配的多样化、专业级问题及详尽答案。
### 数据字段
数据集中的每条样本包含以下内容:
- **`id`**(字符串):样本的唯一标识符
- **`query`**(字符串):从该页面生成的合成烹饪问题
- **`answer`**(字符串):对应查询的详尽答案
- **`image`**(PIL.Image):PDF页面的可视化渲染结果
- **`language`**(字符串):检测到的查询语言
### 数据生成
每个页面将生成4条独特样本:分别为主烹饪查询、次级查询、基于视觉的问题以及多模态语义查询,每条均配有对应的答案。
## 支持任务
本数据集旨在支持以下任务:
- **问答任务**:针对烹饪与食谱内容的模型训练与评估
- **视觉问答任务**:对食谱文档的多模态理解
- **文档检索**:开发面向烹饪与食谱文档的搜索系统
- **文本生成**:从烹饪资源自动生成问答对
- **领域特定应用**:食谱分析、烹饪辅助以及烹饪知识理解
## 数据集应用场景
- 针对烹饪与食谱内容的视觉语言模型训练与评估
- 开发面向烹饪与食谱文档的多模态搜索或检索系统
- 从烹饪与食谱资源自动生成问答对的相关研究
- 优化食谱分析、烹饪辅助以及烹饪知识理解相关工具
- 支持烹饪教育与烹饪助手类应用
## 数据集策展人
- **Yumeng Ye**
- **Léo Appourchaux**
提供机构:
maas
创建时间:
2025-08-23



