omkarthawakar/CoVR-R

Name: omkarthawakar/CoVR-R
Creator: omkarthawakar
Published: 2026-03-21 11:39:06
License: 暂无描述

Hugging Face2026-03-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/omkarthawakar/CoVR-R

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: CoVR-R language: - en tags: - video-retrieval - multimodal - computer-vision - video-text-retrieval - benchmark - reasoning task_categories: - text-retrieval size_categories: - 1K<n<10K --- # CoVR-R: Reason-Aware Composed Video Retrieval CoVR-R is a reasoning-aware benchmark for composed video retrieval. Given a reference video and a textual modification, the goal is to retrieve the correct target video that reflects the requested change and its implied visual consequences. This dataset is designed for settings where simple keyword overlap is not enough. Many edits require reasoning about state transitions, temporal progression, camera changes, and cause-effect relationships. For example, an edit such as "change typing to frustration" may imply visible behaviors like tense motion, stopping work, or closing a laptop, even if those effects are not stated explicitly. The dataset accompanies the paper: **CoVR-R: Reason-Aware Composed Video Retrieval** CVPR 2026 (Findings) Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan ## Dataset Summary CoVR-R contains curated triplets for composed video retrieval: - A reference video - A textual modification instruction - A target video - A reasoning-aware target description for the edited outcome This release contains: - `2634` total examples - `1365` examples from `webvid` - `1269` examples from `ss2` - `4425` video files in the accompanying `videos/` folder Each example is intended to test retrieval under implicit edits, including: - Object and scene state transitions - Temporal phase progression - Action changes and downstream effects - Cinematographic edits such as framing or camera behavior - Changes in pacing or visual emphasis ## Why This Dataset Matters Most prior composed retrieval benchmarks can often be solved with literal text matching. CoVR-R is built to stress reasoning beyond surface overlap. The benchmark focuses on edits whose consequences are visually important but not fully spelled out in the text. This makes it useful for evaluating systems that must reason from: - "before" video evidence - edit instructions - likely "after" visual outcomes In the accompanying paper, this benchmark is used to evaluate a reason-then-retrieve pipeline where a multimodal model first infers the implied after-effects of an edit and then retrieves the best matching target video. ## Supported Tasks and Use Cases CoVR-R is suitable for: - Composed video retrieval - Edit-conditioned retrieval - Retrieval with implicit reasoning - Video-language reasoning benchmarks - Evaluation of multimodal models on causal and temporal understanding - Studying retrieval under cinematographic and state-change edits Example research use cases: - Compare keyword-based retrieval against reasoning-aware retrieval - Evaluate zero-shot multimodal retrieval systems - Train or assess reranking models for edit-conditioned retrieval - Benchmark models on temporal, causal, and camera-aware reasoning - Analyze failure modes on hard distractors ## Data Structure The released JSON file is a list with two top-level groups: - `webvid` - `ss2` Each group contains a list of examples. Each example has the following fields: - `id`: example id within the split - `video_source`: source/reference video id - `video_target`: target video id - `description_source`: source video caption or description - `description_target`: target video caption or description - `modification_text`: the edit instruction applied to the source - `reasoned_target_video_description__main`: a reasoning-aware target description for the edited outcome - `id_original`: original example identifier ## Release Note This Hugging Face release excludes the internal field `reasoned_target_video_description__thinking`. The public dataset keeps only `reasoned_target_video_description__main`, which is the final release-ready reasoning-aware description intended for benchmarking and research use. ## Video Files The accompanying `videos/` directory stores the underlying video files as flat filenames such as: - `1016223889.mp4` - `74225.webm` In the JSON, some ids may appear as path-like values such as `112/1016223889`. In those cases, the actual file in `videos/` is matched by the final path segment, for example: - `112/1016223889` -> `videos/1016223889.mp4` All `video_source` and `video_target` entries in the current release were verified to have matching files in `videos/`. ## Example Instance ```json { "id": 0, "video_source": "112/1016223889", "video_target": "112/1016223877", "description_source": "...", "description_target": "...", "modification_text": "...", "reasoned_target_video_description__main": "...", "id_original": "..." } ``` ## Loading the Dataset The release JSON stores two top-level groups, `webvid` and `ss2`. A simple way to load it with Hugging Face `datasets` is: ```python import json from datasets import Dataset, DatasetDict with open("merged_webvid_ss2.json", "r") as f: raw = json.load(f) webvid = Dataset.from_list(raw[0]["webvid"]) ss2 = Dataset.from_list(raw[1]["ss2"]) dataset = DatasetDict({ "webvid": webvid, "ss2": ss2, }) print(dataset["webvid"][0]) ``` If you prefer, you can also flatten both groups into a single evaluation set. ## Intended Use This dataset is intended for research and evaluation on: - Reasoning-aware composed video retrieval - Multimodal retrieval with implicit edit understanding - Video-language evaluation focused on temporal and causal effects It is especially useful when studying whether a system can infer what should happen after an edit, rather than only matching literal words in the edit text. ## Limitations - The dataset is intended primarily as a benchmark, not a comprehensive real-world distribution of edited video requests. - Reasoning-aware descriptions are curated artifacts and may reflect annotation choices made for evaluation. - Performance on CoVR-R should not be interpreted as broad real-world competence on all video reasoning tasks. - Models may still exploit superficial cues unless evaluation protocols are designed carefully. ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{thawakar2026covrr, title = {CoVR-R: Reason-Aware Composed Video Retrieval}, author = {Thawakar, Omkar and Demidov, Dmitry and Potlapalli, Vaishnav and Bogireddy, Sai Prasanna Teja Reddy and Gajjala, Viswanatha Reddy and Lasheen, Alaa Mostafa and Anwer, Rao Muhammad and Khan, Fahad Shahbaz}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, year = {2026} } ``` ## Acknowledgments CoVR-R is introduced by researchers from Mohamed bin Zayed University of Artificial Intelligence, University of Chicago, University of Wisconsin-Madison, and Linkoping University.

提供机构：

omkarthawakar

5,000+

优质数据集

54 个

任务类型

进入经典数据集