JackMuX3Y/OmniRAG-Agent
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JackMuX3Y/OmniRAG-Agent
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- visual-question-answering
- video-classification
tags:
- video-qa
- multimodal
- audio-visual
- rag
- reinforcement-learning
- grpo
- omnimodal
size_categories:
- 1K<n<10K
configs:
- config_name: VideoOmniBench
data_files:
- split: train
path: VideoOmniBench/train.parquet
- config_name: WorldSense
data_files:
- split: train
path: WorldSense/train.parquet
- config_name: Daily-Omni
data_files:
- split: train
path: Daily-Omni/train.parquet
---
# OmniRAG-Agent Dataset
Training data for **OmniRAG-Agent**: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering \[[paper](https://arxiv.org/abs/2602.03707)\]
---
## Dataset Preview
| Split | VideoOmniBench | WorldSense | Daily-Omni | Total |
|-------|---------------|------------|------------|-------|
| train | 504 | 504 | 504 | **1,512** |
---
## Dataset Structure
```
OmniRAG-Agent/
├── VideoOmniBench/
│ └── train.parquet # 504 samples — visual perception & reasoning
├── WorldSense/
│ └── train.parquet # 504 samples — audio-visual understanding
└── Daily-Omni/
└── train.parquet # 504 samples — omnimodal event understanding
```
---
## Dataset Overview
| Sub-dataset | Samples | Domain | Ability Focus |
|-----------------|---------|---------------------------------|---------------|
| VideoOmniBench | 504 | Video QA | Counting, Temporal, Causal, Spatial, Ego Reasoning, etc. |
| WorldSense | 504 | Audio-Visual QA | Audio Recognition, Emotion, Spatial Relation, Event Sorting, etc. |
| Daily-Omni | 504 | Omnimodal Event QA | Event Sequence, AV Alignment, Inference, Reasoning, etc. |
All samples are **4-choice multiple-choice questions (A/B/C/D)** over short video clips (downsampled at 5-second intervals).
---
## Data Format
Each parquet file contains the following fields:
```json
{
"id": "video_109__q0000",
"data_source": "OmniVideoBench",
"prompt": [
{
"role": "user",
"content": "<audio> <video>How many athletes participating in the race were shown in the video?"
}
],
"images": [],
"audios": [],
"videos": [
{
"type": "video",
"video": "data/videoomnibench/videos_downsampled/video_109_t5s.mp4"
}
],
"fps": 0,
"max_frames": 0,
"min_frames": 0,
"ability": "counting",
"target": "D",
"style": "",
"answer": "5",
"options": ["A.2", "B.3", "C.4", "D.5"],
"correct_option": "D",
"index": 0,
"original_id": "video_109__q0000",
"question": "How many athletes participating in the race were shown in the video?",
"split": "train"
}
```
---
## Field Descriptions
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique sample identifier (`<video_id>__<question_id>`) |
| `data_source` | string | Source dataset name (always `OmniVideoBench`) |
| `prompt` | list[dict] | Conversation turns with `role` and `content`; content includes `<audio>` and `<video>` placeholders |
| `images` | list | Reserved for image inputs (empty in current split) |
| `audios` | list | Reserved for standalone audio inputs (audio is embedded in video) |
| `videos` | list[dict] | Video input: `{"type": "video", "video": "<relative_path>"}` |
| `fps` | int | Frame sampling rate (`0` = use model default) |
| `max_frames` | int | Maximum frames to sample (`0` = use model default) |
| `min_frames` | int | Minimum frames to sample (`0` = use model default) |
| `ability` | string | Fine-grained ability category (see below) |
| `target` | string | Correct answer letter: `A`, `B`, `C`, or `D` |
| `style` | string | Question style tag (reserved) |
| `answer` | string | Ground-truth answer text |
| `options` | list[str] | Four answer choices, e.g. `["A.xxx", "B.xxx", "C.xxx", "D.xxx"]` |
| `correct_option` | string | Correct option letter (`A`/`B`/`C`/`D`) |
| `index` | int | Sample index within sub-dataset |
| `original_id` | string | Original ID from the source dataset |
| `question` | string | Plain-text question (without modality tokens) |
| `split` | string | Dataset split (`train`) |
---
## Ability Categories
### VideoOmniBench
| Ability | Description |
|---------|-------------|
| `counting` | Count entities or events in the video |
| `fine-grained perception` | Identify subtle visual details |
| `ego reasoning` | First-person perspective understanding |
| `sentiment analysis` | Detect sentiment or emotion from video |
| `temporal understanding` | Reason about time order and duration |
| `causal reasoning` | Infer cause-and-effect relationships |
| `reference reasoning` | Resolve co-references across frames |
| `summarization` | Summarize the overall video content |
| `background & music understanding` | Understand scene background and music |
| `spatial understanding` | Reason about spatial relationships |
### WorldSense
| Ability | Description |
|---------|-------------|
| `Audio Recognition` | Identify sounds, instruments, or speech |
| `Event Recognition` | Recognise what event is happening |
| `Object Counting` | Count objects visible or audible |
| `Spatial Relation` | Determine spatial positions |
| `Audio Source Localization` | Locate the source of a sound |
| `Video Emotions` | Infer overall emotional tone |
| `Emotion Change` | Detect change in emotion over time |
| `Action Counting` | Count repeated actions |
| `Event Sorting` | Order events chronologically |
| `Audio Change` | Detect changes in audio over time |
### Daily-Omni
| Ability | Description |
|---------|-------------|
| `Event Sequence` | Order audio-visual events correctly |
| `AV Event Alignment` | Match audio events to visual moments |
| `Inference` | Infer implicit information |
| `Reasoning` | Multi-step logical reasoning |
| `Context understanding` | Understand overall context |
| `Comparative` | Compare attributes across clips |
---
## Video File Structure
Videos are stored as downsampled 5-second clips (suffix `_t5s.mp4`) and referenced by **relative paths**:
```
data/
├── videoomnibench/
│ └── videos_downsampled/
│ └── video_<id>_t5s.mp4
├── WorldSense/
│ └── videos_downsampled/
│ └── <video_id>_t5s.mp4
└── Daily-Omni/
└── videos_downsampled/
└── <video_id>_video_t5s.mp4
```
---
## Usage
```python
from datasets import load_dataset
# Load a single sub-dataset
vob = load_dataset(
"parquet",
data_files="VideoOmniBench/train.parquet",
split="train"
)
# Load all three sub-datasets and concatenate
from datasets import concatenate_datasets
splits = ["VideoOmniBench", "WorldSense", "Daily-Omni"]
datasets = [
load_dataset("parquet", data_files=f"{s}/train.parquet", split="train")
for s in splits
]
full_dataset = concatenate_datasets(datasets)
print(f"Total samples: {len(full_dataset)}") # 1512
```
### Accessing a sample
```python
sample = vob[0]
print(sample["question"]) # plain-text question
print(sample["options"]) # ['A.xxx', 'B.xxx', 'C.xxx', 'D.xxx']
print(sample["correct_option"]) # 'D'
print(sample["videos"][0]["video"]) # relative video path
```
---
## Evaluation Metric
All sub-datasets use **multiple-choice accuracy** (exact match of the predicted option letter against `correct_option`) as the primary evaluation metric.
| Metric | Description |
|--------|-------------|
| Accuracy | `correct_option` exact match |
| Per-ability Accuracy | Accuracy broken down by `ability` field |
---
## Related Resources
- **Paper**: [OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video QA](https://arxiv.org/abs/2602.03707)
- **RAG Server**: FastAPI retrieval service (`retrival_api/retriever.py`) — exposes `/query` (image) and `/query_audio` (audio) endpoints
- **Training Script**: `examples/grpo_trainer/run_omni_searchqa.sh` — GRPO training with Qwen2.5-Omni-3B
- **Evaluation Script**: `omni_batch_eval.py` — agent-loop batch evaluation
---
## Citation
```bibtex
@article{omnirag2025,
title = {OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering},
year = {2025},
url = {https://arxiv.org/abs/2602.03707}
}
```
提供机构:
JackMuX3Y



