Video-Reason/video-mcp

Name: Video-Reason/video-mcp
Creator: Video-Reason
Published: 2026-04-01 10:27:54
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Video-Reason/video-mcp

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: derivative-mixed license_link: LICENSE task_categories: - visual-question-answering - video-classification tags: - video - mcqa - vqa - video-generation - wan2.2 - i2v - vbvr size_categories: - 1K<n<10K --- # Video-MCP <a href="https://video-reason.com" target="_blank"> <img alt="Project Page" src="https://img.shields.io/badge/Project%20-%20Homepage-4285F4" height="20" /> </a> <a href="https://github.com/Video-Reason/VBVR-EvalKit" target="_blank"> <img alt="Code" src="https://img.shields.io/badge/Evaluation_code-VBVR_Bench-100000?style=flat-square&logo=github&logoColor=white" height="20" /> </a> <a href="https://github.com/Video-Reason/VBVR-Wan2.2" target="_blank"> <img alt="Code" src="https://img.shields.io/badge/Training_code-VBVR_Wan2.2-100000?style=flat-square&logo=github&logoColor=white" height="20" /> </a> <a href="https://github.com/Video-Reason/VBVR-DataFactory" target="_blank"> <img alt="Code" src="https://img.shields.io/badge/Data_code-VBVR_DataFactory-100000?style=flat-square&logo=github&logoColor=white" height="20" /> </a> <a href="https://huggingface.co/papers/2602.20159" target="_blank"> <img alt="arXiv" src="https://img.shields.io/badge/arXiv-VBVR-red?logo=arxiv" height="20" /> </a> <a href="https://huggingface.co/Video-Reason/VBVR-Wan2.2" target="_blank"> <img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_VBVR_Wan2.2-Model-ffc107?color=ffc107&logoColor=white" height="20" /> </a> <a href="https://huggingface.co/datasets/Video-Reason/VBVR-Dataset" target="_blank"> <img alt="Dataset" src="https://img.shields.io/badge/%F0%9F%A4%97%20_VBVR_Dataset-Data-ffc107?color=ffc107&logoColor=white" height="20" /> </a> <a href="https://huggingface.co/datasets/Video-Reason/VBVR-Bench-Data" target="_blank"> <img alt="Bench Data" src="https://img.shields.io/badge/%F0%9F%A4%97%20_VBVR_Bench-Data-ffc107?color=ffc107&logoColor=white" height="20" /> </a> <a href="https://huggingface.co/spaces/Video-Reason/VBVR-Bench-Leaderboard" target="_blank"> <img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_VBVR_Bench-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" /> </a> **Video-MCP** is a synthetic video dataset for training and evaluating video generation models on **multiple-choice question-answering (MCQA)** tasks. Each sample is a short video clip (~5 seconds) where a visual question-answering prompt is embedded directly into the video frames, and the correct answer is revealed by progressively highlighting one of four answer boxes (A/B/C/D) over the duration of the clip. The dataset is designed for fine-tuning image-to-video models (specifically **Wan2.2-I2V-A14B**) to produce videos that "answer" visual questions by highlighting the correct option. Output follows the **[VBVR DataFactory](https://github.com/video-reason/VBVR-DataFactory)** directory convention. ## Examples Each clip starts with no answer highlighted, then progressively reveals the correct choice over ~5 seconds: ### CoreCognition (M-1) — General Visual Reasoning | Answer: B | Answer: B | |---|---| | ![corecognition 0](examples/corecognition_0.gif) | ![corecognition 1](examples/corecognition_1.gif) | ### ScienceQA (M-2) — Science Education | Answer: A | Answer: A | |---|---| | ![scienceqa 0](examples/scienceqa_0.gif) | ![scienceqa 1](examples/scienceqa_1.gif) | ### MathVision (M-3) — Competition Math | Answer: A | Answer: D | |---|---| | ![mathvision 0](examples/mathvision_0.gif) | ![mathvision 1](examples/mathvision_1.gif) | ### PhyX (M-4) — Physics Reasoning | Answer: C | Answer: C | |---|---| | ![phyx 0](examples/phyx_0.gif) | ![phyx 1](examples/phyx_1.gif) | ## Dataset Details | Property | Value | |---|---| | **Version** | 1.0 | | **Total samples** | 6,912 | | **Video resolution** | 832x480 | | **Frame count** | 81 frames per clip | | **Frame rate** | 16 FPS | | **Duration** | ~5.06 seconds per clip | | **Codec** | H.264, yuv420p, MP4 container | | **Highlight style** | darken (default) | ## Source Datasets Video-MCP draws from four publicly available MCQA-VQA datasets on Hugging Face: | Generator ID | Name | Source | Samples | Domain | |---|---|---|---|---| | M-1 | corecognition | `williamium/CoreCognition` | 753 | General visual reasoning | | M-2 | scienceqa | `derek-thomas/ScienceQA` | 3,905 | Science education (image-only subset) | | M-3 | mathvision | `MathLLMs/MathVision` | 1,254 | Competition math with diagrams | | M-4 | phyx | `Cloudriver/PhyX` | 1,000 | Physics reasoning | All source datasets are filtered to include only samples that have an associated image and exactly four answer choices (A/B/C/D). ## Data Structure Each sample follows the [VBVR DataFactory](https://github.com/video-reason/VBVR-DataFactory) directory convention: ``` {generator_id}_{name}_data-generator/ clip_config.json {name}_task/ {name}_{NNNN}/ first_frame.png # Frame 0: question visible, no highlight prompt.txt # Plain-text question, choices, and answer final_frame.png # Last frame: correct answer fully highlighted ground_truth.mp4 # Full clip with progressive answer reveal original/ question.json # Structured metadata (JSON) <source_image> # Original image from source dataset ``` ### File Descriptions | File | Description | |---|---| | `first_frame.png` | The opening frame showing the question panel (image + question text + four choices) with A/B/C/D answer boxes in the corners. No answer is highlighted. | | `final_frame.png` | The closing frame with the correct answer box fully highlighted. | | `ground_truth.mp4` | The complete video clip. The correct answer gradually highlights from frame 1 to the final frame (linear fade-in). | | `prompt.txt` | Human-readable text: question, choices (A/B/C/D), and the correct answer letter. | | `original/question.json` | Structured JSON with fields: `dataset`, `source_id`, `question`, `choices`, `answer`, `original_image_filename`. | | `original/<image>` | The raw source image preserved with its original filename. | | `clip_config.json` | Generator-level config: `fps`, `seconds`, `num_frames`, `width`, `height`. | ### Frame Layout Each frame uses a two-column layout: - **Left column**: the source VQA image, scaled to fill. - **Right column**: question text and the four answer options. - **Corners**: A (top-left), B (top-right), C (bottom-left), D (bottom-right) answer boxes. ### prompt.txt Format ``` What color is the object in the image? A: Red B: Blue C: Green D: Yellow Answer: A ``` ## Video Specifications These defaults align with **Wan2.2-I2V-A14B** fine-tuning constraints: - **Resolution**: 832x480 (width and height divisible by 8 for VAE spatial compression) - **Frames**: 81 (satisfies `1 + 4k` for VAE temporal grid) - **FPS**: 16 - **Duration**: ~5.06 seconds - **Codec**: H.264, yuv420p pixel format ## Intended Use - Fine-tuning image-to-video generation models to produce MCQA-answering videos - Evaluating video generation models on structured visual reasoning tasks - Research on embedding structured UI interactions into generated video ## Limitations - All source questions are filtered to exactly 4 choices (A/B/C/D); questions with fewer or more options are excluded. - The answer highlight is a simple linear fade-in; no complex visual dynamics. - Source images and questions inherit any biases or errors from the upstream HF datasets. - The dataset uses a single fixed resolution (832x480) and frame count (81). ## Citation If you use this dataset, please cite the source datasets: - **CoreCognition**: `williamium/CoreCognition` on Hugging Face - **ScienceQA**: Lu et al., "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering" (NeurIPS 2022) - **MathVision**: Wang et al., "MathVision: Measuring Multimodal Mathematical Reasoning with Benchmarks" (2024) - **PhyX**: `Cloudriver/PhyX` on Hugging Face ## License This dataset is a derivative work. Each source dataset has its own license terms. Users should verify compliance with upstream licenses before redistribution. ## Generation Code [https://github.com/video-reason/video-mcp](https://github.com/video-reason/video-mcp)

提供机构：

Video-Reason

搜集汇总

数据集介绍

构建方式

在视频生成与视觉问答交叉领域，Video-MCP数据集通过系统化合成方法构建而成。该数据集整合了四个公开的多选题视觉问答数据集，包括CoreCognition、ScienceQA、MathVision与PhyX，并严格筛选出包含图像且具有四个选项的样本。每个样本被转化为约5秒的视频片段，采用左图右文的双栏布局，在视频帧中嵌入问题与选项，并通过线性渐显效果动态高亮正确答案框，最终生成包含81帧、分辨率为832x480的标准格式视频，以满足特定视频生成模型的训练需求。

使用方法

该数据集主要应用于图像到视频生成模型的微调与评估。研究人员可利用其进行模型训练，使模型学会根据输入图像与问题生成能动态揭示答案的视频序列。评估时，可通过对比生成视频与数据集中真实视频在答案高亮时序与内容一致性上的表现，量化模型在视觉推理任务上的能力。数据集遵循VBVR DataFactory的目录结构，每个样本包含初始帧、最终帧、完整视频、文本提示及原始元数据，便于直接加载并集成到现有训练与评估流程中，推动视频生成技术在交互式视觉问答场景中的应用研究。

背景与挑战

背景概述

在视频生成与多模态推理交叉领域，Video-MCP数据集于2024年由Video-Reason研究团队构建，旨在推动图像到视频生成模型在结构化视觉问答任务上的能力发展。该数据集整合了来自CoreCognition、ScienceQA、MathVision及PhyX四个公开多选视觉问答数据集的核心内容，通过合成约5秒时长的视频片段，将视觉问题与四个选项嵌入帧内，并以渐进高亮方式揭示正确答案。其核心研究问题聚焦于如何使生成模型不仅产生连贯视频，更能执行复杂的视觉推理并输出明确决策，从而为视频生成模型的可控性与逻辑性评估设立了新基准，对多模态人工智能的演进产生了实质性影响。

当前挑战

Video-MCP所针对的领域挑战在于提升视频生成模型在视觉问答任务中的精确推理与答案呈现能力，这要求模型深入理解图像内容、文本问题及选项逻辑，并能在时间维度上准确、一致地可视化答案选择过程。在数据集构建过程中，挑战主要源于多源数据的对齐与标准化，包括确保所有样本均包含图像且严格限定为四个选项，以及设计统一的视频布局与高亮动画以适配特定模型（如Wan2.2-I2V-A14B）的输入约束。此外，合成视频在保持源数据语义完整性的同时，需克服视觉动态简单化与分辨率固定所带来的表达局限性。

常用场景

经典使用场景

在视觉问答与视频生成交叉领域，Video-MCP数据集为图像到视频模型的微调提供了结构化基准。其核心应用场景在于训练模型生成能够动态揭示多选题答案的短视频序列，通过渐进式高亮正确答案框，模拟视觉推理过程。该数据集将静态视觉问题转化为时序动态表达，为研究视频生成模型在结构化交互任务上的表现奠定了数据基础。

解决学术问题

该数据集旨在解决视频生成模型在复杂视觉推理任务中缺乏可控性与可解释性的学术难题。通过嵌入明确的多选题回答机制，它推动了模型从单纯的内容合成向逻辑演绎能力演进。其意义在于构建了连接视觉理解与时序生成的评估框架，为衡量模型在科学教育、数学推理等专业领域的认知能力提供了量化标准，促进了生成式人工智能向更高层次推理迈进。

实际应用

在实际应用层面，Video-MCP衍生的技术可赋能智能教育系统，生成动态解题演示视频，辅助学生理解科学概念与数学问题。在交互式媒体创作中，该框架支持自动生成带引导性视觉反馈的教学内容。此外，其高亮机制为开发无障碍视觉辅助工具提供了新思路，能够通过时序视觉提示增强信息传达的清晰度与可访问性。

数据集最近研究