jankin123/OmniVideo-R1

Name: jankin123/OmniVideo-R1
Creator: jankin123
Published: 2026-03-11 08:21:42
License: 暂无描述

Hugging Face2026-03-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/jankin123/OmniVideo-R1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - video-text-to-text - question-answering - visual-question-answering language: - en tags: - audio-visual - video-understanding - multimodal - reinforcement-learning - reasoning pretty_name: OmniVideo-R1 Training Data size_categories: - 10K<n<100K --- # OmniVideo-R1 Training Data Training data for **OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention**. [![arXiv](https://img.shields.io/badge/arXiv-PDF-red)](https://arxiv.org/abs/2602.05847) [![GitHub](https://img.shields.io/badge/GitHub-Code-blue)](https://github.com/zhangquanchen/OmniVideo-R1) ## Dataset Description This dataset contains the preprocessed training data used in the OmniVideo-R1 framework, which improves mixed-modality (audio + video) reasoning through two training stages: - **Query-Intensive (QI) Grounding Stage**: Large-scale audio-visual QA data for building strong query-grounded understanding. - **Modality-Attentive (MA) Fusion Stage**: Curated subset for learning fine-grained cross-modal fusion. ## Dataset Files | File | Samples | Description | |------|---------|-------------| | `merged_train_all_qi.jsonl` | 88,173 | QI stage training data | | `merged_train_fusion_ma.jsonl` | 12,887 | MA stage training data | ## Data Format Each line is a JSON object with the following fields: | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique sample identifier | | `Type` | string | Task/question type | | `messages` | list | Conversation in `[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]` format | | `problem` | string | The question/prompt text | | `solution` | string | The ground-truth answer | | `videos` | list[string] | Relative path(s) to video file(s) | | `audios` | list[string] | Relative path(s) to audio file(s) | | `data_source` | string | Source dataset identifier | ### Example ```json { "id": "209-Vyrq2i3R_Sk-split_4", "Type": "0_30_s_academic_mc_v0_1_qa_processed", "messages": [ { "role": "user", "content": "<video><audio>\nWhere does the video take place?\nA. A busy street\nB. A quiet library\nC. A bustling kitchen\nD. A cozy living room\nPlease respond with only the letter of the correct answer." }, { "role": "assistant", "content": "C." } ], "problem": "<video><audio>\nWhere does the video take place?\nA. A busy street\nB. A quiet library\nC. A bustling kitchen\nD. A cozy living room\nPlease respond with only the letter of the correct answer.", "solution": "C.", "videos": ["./data/LLaVA-Video-178K/0_30_s_academic_v0_1/academic_source/youcook2/209/Vyrq2i3R_Sk/split_4.mp4"], "audios": ["./data/LLaVA-Video-178K-audios/0_30_s_academic_v0_1/academic_source/youcook2/209/Vyrq2i3R_Sk/split_4.mp3"], "data_source": "0_30_s_academic_v0_1" } ``` ## Data Sources The training annotations are derived from the following video datasets: | Source | QI Samples | MA Samples | |--------|-----------|-----------| | LLaVA-Video-178K (YouTube) | 51,324 | 5,699 | | VideoVista | 23,767 | 5,375 | | LLaVA-Video-178K (Academic) | 9,783 | 609 | | PerceptionTest | 1,394 | 95 | | ActivityNetQA | 1,414 | 40 | | NextQA | 491 | 63 | | **Total** | **88,173** | **12,887** | ## Task Types The dataset covers diverse audio-visual understanding tasks: - **Multiple Choice QA** (`mc`): Select the correct answer from given options. - **Open-Ended QA** (`oe`): Generate free-form answers to questions. - **Video Description**: Brief description, detailed description, event description. - **Temporal Understanding**: Object temporal location, event sequences, action sequence. - **Spatial Understanding**: Object spatial location, object spatial relation, spatial tracking. - **Action Understanding**: Action recognition, action prediction, action location. - **Object Understanding**: Object existence, object counting. ## Video & Audio Files > **Note**: This dataset contains only the **annotation files** (JSONL). The raw video and audio files should be downloaded separately from: > > - 🎬 [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K) > - 🎬 [VideoVista_Train](https://huggingface.co/datasets/Uni-MoE/VideoVista_Train) > > Audio tracks (`.mp3`) should be extracted from the downloaded videos. The `videos` and `audios` fields in each sample contain the relative paths to these files. ## Usage ```python import json # Load the QI stage data qi_data = [] with open("merged_train_all_qi.jsonl", "r") as f: for line in f: qi_data.append(json.loads(line)) print(f"QI samples: {len(qi_data)}") print(f"First sample question: {qi_data[0]['problem'][:100]}...") ``` ## Citation ```bibtex @article{chen2026omnivideo, title={OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention}, author={Chen, Zhangquan and Tao, Jiale and Li, Ruihuang and Hu, Yihao and Chen, Ruitao and Yang, Zhantao and Yu, Xinlei and Jing, Haodong and Zhang, Manyuan and Shao, Shuai and others}, journal={arXiv preprint arXiv:2602.05847}, year={2026} } ``` ## License This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

提供机构：

jankin123

5,000+

优质数据集

54 个

任务类型

进入经典数据集