five

sshaar/movierecapsqa

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sshaar/movierecapsqa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - visual-question-answering - video-text-to-text - question-answering language: - en pretty_name: MovieRecapsQA Benchmark size_categories: - 1K<n<10K tags: - video-qa - long-video-understanding - multimodal-qa - video-and-text-qa - video-question-answering configs: - config_name: default data_files: - split: questions path: "data/questions.json" - split: recaps path: "data/recaps.json" - split: segments path: "data/segments.json" - split: facts path: "data/facts.json" --- # MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark ## Benchmark Description MovieRecapsQA is a benchmark for evaluating **multimodal question answering** with **video and text as context**. The benchmark assesses long-form video understanding in vision-language models using 8,263 question-answer pairs about YouTube movie recap videos. Questions require reasoning over both visual content (video frames) and textual content (dialogue/subtitles), aligned with atomic facts and movie subtitles for temporal grounding. It is designed to evaluate multimodal comprehension across dialogue, visual scenes, and narrative understanding in extended video content. **Copyright Notice**: This benchmark does NOT include: - Full-length movie files - Movie subtitle files - YouTube recap video files or captions URLs are provided to enable researchers to access movie subtitles and IMDb metadata through proper channels. ### Benchmark Structure The benchmark is organized into 4 normalized tables to eliminate fact duplication: #### 1. `recaps` (74 entries) Each entry represents a unique YouTube recap video. Multiple recap videos may cover the same movie. - `video_id`: Unique identifier for the YouTube recap video - `movie_name`: Name of the movie that this recap video is about - `subtitle_url`: Direct download link for the movie's subtitle file - `imdb_url`: IMDb page URL for the movie #### 2. `segments` (1,430 entries) - `video_id`: Reference to recaps table - `segment_id`: Segment number within the recap video - `movie_start_time`: Start timestamp in the full-length movie subtitle file (seconds) - `movie_end_time`: End timestamp in the full-length movie subtitle file (seconds) - `recap_start_time`: Start timestamp in the YouTube recap video (seconds) - `recap_end_time`: End timestamp in the YouTube recap video (seconds) - `fact_ids`: List of fact IDs associated with this segment #### 3. `questions` (8,263 entries) - `video_id`: Reference to recaps table - `segment_id`: Reference to segments table - `question_id`: Question number within the segment - `question`: The question text - `answer`: The answer text - `verbose_question`: More detailed version of the question - `vague_answer`: Intentionally vague version of the answer - `aligned_fact_ids`: List of fact IDs aligned with this question #### 4. `facts` (unique facts across all segments) - `fact_id`: Global unique identifier for the fact - `video_id`: Reference to recaps table - `segment_id`: Reference to segments table - `fact`: The atomic fact text (cleaned, no numbering prefix) ### Loading the Benchmark The benchmark data is stored as JSON files in the `data/` directory. Load them as follows: ```python from datasets import load_dataset import json # Load each table recaps = json.loads(load_dataset("sshaar/movierecapsqa", data_files="data/recaps.json", split="train")[0]["text"]) segments = json.loads(load_dataset("sshaar/movierecapsqa", data_files="data/segments.json", split="train")[0]["text"]) questions = json.loads(load_dataset("sshaar/movierecapsqa", data_files="data/questions.json", split="train")[0]["text"]) facts = json.loads(load_dataset("sshaar/movierecapsqa", data_files="data/facts.json", split="train")[0]["text"]) # Or download directly from the repository from huggingface_hub import hf_hub_download import json recaps_path = hf_hub_download(repo_id="sshaar/movierecapsqa", filename="data/recaps.json", repo_type="dataset") with open(recaps_path) as f: recaps = json.load(f) segments_path = hf_hub_download(repo_id="sshaar/movierecapsqa", filename="data/segments.json", repo_type="dataset") with open(segments_path) as f: segments = json.load(f) questions_path = hf_hub_download(repo_id="sshaar/movierecapsqa", filename="data/questions.json", repo_type="dataset") with open(questions_path) as f: questions = json.load(f) facts_path = hf_hub_download(repo_id="sshaar/movierecapsqa", filename="data/facts.json", repo_type="dataset") with open(facts_path) as f: facts = json.load(f) ``` ### Accessing External Resources The benchmark provides URLs to external resources that are not included due to copyright: - **Movie Subtitles**: Use `subtitle_url` from the `recaps` table to download subtitle files for the full-length movies - **Movie Metadata**: Use `imdb_url` from the `recaps` table for information about the full-length movies - **YouTube Recap Videos**: Use `video_id` to access the original recap videos on YouTube (not the full-length movies) - **Temporal Alignment**: Use `movie_start_time`/`movie_end_time` to locate dialogue in movie subtitles, and `recap_start_time`/`recap_end_time` to locate segments in the recap videos ### Benchmark Statistics - **Total Recap Videos**: 74 - **Total Segments**: 1,430 - **Total Questions**: 8,263 ### Benchmark Results Performance of state-of-the-art vision-language models and human annotators on MovieRecapsQA. Results are reported as mean scores (scale 1-5) across different question types and categories. **Question Types:** - **Dialogue**: Questions about spoken content in the recap video - **Scene**: Questions about visual content only - **Multimodal**: Questions requiring both visual and dialogue understanding **Question Categories:** - **CRD**: Character Reasoning & Dialogue - **NPA**: Narrative Progression & Action - **STA**: Story Theme & Analysis - **TEMP**: Temporal Understanding - **TH**: Theory of Mind #### Relevance Scores | Model | Overall | Dialogue | Scene | Multimodal | CRD | NPA | STA | TEMP | TH | |-------|---------|----------|-------|------------|-----|-----|-----|------|-----| | **Best Human*** | **4.59** | -- | -- | -- | -- | -- | -- | -- | -- | | **Avg. Human*** | 4.01 | 4.27 | 3.97 | 4.00 | 4.05 | 3.98 | **4.41** | -- | 4.11 | | **---** | **---** | **---** | **---** | **---** | **---** | **---** | **---** | **---** | **---** | | **GPT-4o** | 3.97 | 3.71 | 3.55 | 3.84 | 3.78 | 3.73 | 3.32 | 3.59 | 3.76 | | **Amazon Nova Lite** | 3.93 | **4.12** | **3.82** | **3.99** | **3.97** | **3.95** | 3.81 | **3.94** | **4.23** | | **Claude 3.5 Sonnet** | 3.92 | 3.88 | 3.71 | 3.83 | 3.86 | 3.72 | 3.61 | **3.99** | 3.82 | | **Qwen2.5-VL** | 3.83 | 3.93 | 3.69 | 3.72 | 3.78 | 3.75 | 3.80 | 3.90 | 3.91 | | **Gemini-2.5-Flash** | 3.70 | 3.66 | 3.45 | 3.67 | 3.67 | 3.58 | 3.38 | 3.41 | 3.62 | | **MiniCPM-o** | 3.61 | 3.54 | 3.55 | 3.52 | 3.52 | 3.50 | 3.56 | 3.66 | 3.74 | | **LLaVA-NeXT-Video** | 3.35 | 3.36 | 3.35 | 3.33 | 3.30 | 3.31 | 3.37 | 3.54 | 3.52 | #### Factuality Scores | Model | Overall | Dialogue | Scene | Multimodal | CRD | NPA | STA | TEMP | TH | |-------|---------|----------|-------|------------|-----|-----|-----|------|-----| | **Best Human*** | **4.53** | -- | -- | -- | -- | -- | -- | -- | -- | | **Avg. Human*** | 4.01 | 4.17 | 3.84 | 3.98 | 4.07 | 3.86 | **4.15** | -- | **4.14** | | **---** | **---** | **---** | **---** | **---** | **---** | **---** | **---** | **---** | **---** | | **GPT-4o** | 3.99 | **3.76** | 3.43 | **3.66** | **3.73** | **3.64** | 3.10 | **3.58** | 3.55 | | **Claude 3.5 Sonnet** | 3.76 | 3.69 | 3.17 | 3.58 | 3.65 | 3.42 | 3.12 | 3.30 | 3.44 | | **Amazon Nova Lite** | 3.53 | 3.73 | 3.35 | 3.58 | 3.59 | 3.60 | 3.15 | 3.51 | 3.37 | | **Qwen2.5-VL** | 3.47 | 3.50 | 3.28 | 3.35 | 3.42 | 3.40 | 3.07 | 3.39 | 3.27 | | **Gemini-2.5-Flash** | 3.26 | 3.34 | 2.65 | 3.03 | 3.15 | 3.00 | 2.57 | 2.53 | 3.16 | | **MiniCPM-o** | 3.21 | 3.15 | 3.00 | 3.09 | 3.14 | 3.10 | 2.76 | 3.02 | 3.02 | | **LLaVA-NeXT-Video** | 2.96 | 2.99 | 2.88 | 2.88 | 2.99 | 2.90 | 2.65 | 3.04 | 2.78 | *Human performance evaluated on a sample of 118 questions. TEMP scores for humans were not available. **Bold** indicates the best model score in each column (excluding human benchmarks). For complete results including ablation studies (frame-only and dialogue-only variants), see the paper. ### Citation If you use this benchmark in your research, please cite our CVPR 2026 paper: ```bibtex @inproceedings{shaar2026movierecapsqa, title={MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark}, author={Shaar, Shaden and Thymes, Bradon and Chaixanien, Sirawut and Cardie, Claire and Hariharan, Bharath}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2026}, url={https://arxiv.org/abs/2601.02536} } ``` ### License This benchmark is released under CC-BY-4.0 license. **External Resources**: Full-length movie files, movie subtitle files, and YouTube recap video files/captions are NOT included in this benchmark to respect copyright. URLs are provided to access movie subtitles and metadata through proper channels, subject to their respective terms of service.
提供机构:
sshaar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作