mr3haque/OmniAgent-Data
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mr3haque/OmniAgent-Data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- visual-question-answering
- text-generation
- image-to-text
- text-to-image
- text-to-audio
- text-to-video
language:
- en
tags:
- multimodal
- instruction-tuning
- agentic
- any-to-any
- preference-optimization
- tool-use
- reinforcement-learning
- mm-dpo
- grpo
- simpo
- cvpr
size_categories:
- 100K<n<1M
pretty_name: OmniAgent Complete Training Data
---
# OmniAgent-Data: Complete Training Data
**Md Rezwan Haque** | CPAMI Lab, University of Waterloo
[](https://huggingface.co/mr3haque/OmniAgent)
[](https://github.com/rezwanh001/OmniAgent)
[](https://huggingface.co/mr3haque/OmniAgent/blob/main/notebooks/OmniAgent_Inference.ipynb)
This is the **canonical and complete** dataset repository for OmniAgent. It contains **all data** used across the 4-stage training pipeline, including our novel **MAgenIT** dataset and preference data for 6 RL alignment methods.
---
## All Datasets at a Glance
| Dataset | Folder | Samples | Training Stage | Description |
|---|---|---:|:---:|---|
| **MAgenIT** (original) | `magenit/train.jsonl` | 5,000 | 3 (SFT) | Cross-modal agentic instructions (our contribution) |
| **MAgenIT** (augmented) | `magenit/train_augmented.jsonl` | 50,000 | 3 (SFT) | Augmented with cross-modal variations |
| **Preferences** (original) | `preferences/train.jsonl` | 14,444 | 4 (RL) | Human-curated chosen/rejected pairs |
| **Preferences** (augmented) | `preferences/train_augmented.jsonl` | 50,000 | 4 (RL) | Augmented preference pairs |
| Understanding SFT | `understanding_sft/train.jsonl` | 10,000 | 3 (SFT) | Multimodal understanding instructions |
| ToolBench | `toolbench/train.jsonl` | 4,000 | 3 (SFT) | Tool-use instruction data |
| CC3M | `cc3m/train.jsonl` | 100,000 | 1 (Encode) | Image-caption pairs |
| CC3M (real captions) | `cc3m/train_real_captions.jsonl` | 20,000 | 1 (Encode) | Original captions |
| AudioCaps | `audiocaps/train.jsonl` | 49,838 | 1 (Encode) | Audio-caption pairs |
| LLaVA-Instruct | `llava_instruct/train.jsonl` | 394,276 | 1, 3 | Visual instruction tuning |
| LLaVA-Instruct (50K) | `llava_instruct/train_50k.jsonl` | 50,000 | 1 (Encode) | Subset for Stage 1 |
| Decoder embeddings | `decoder_embeddings/train.jsonl` | 71,000 | 2 (Decode) | Precomputed target embeddings |
| WebVid | `webvid/train.jsonl` | -- | 1 (Encode) | Video-caption pairs |
| Benchmarks | `benchmarks/` | 600 | Eval | Held-out evaluation data |
**Total:** ~820K samples across all splits.
---
## MAgenIT: Our Novel Dataset Contribution
**MAgenIT** (Multimodal Agentic Instruction Tuning) is the first dataset designed specifically for training multimodal agents that **reason, plan, generate across modalities, and use tools** within a unified framework.
### 6 Task Categories
| Category | Original | Augmented | What the Model Must Do |
|---|---:|---:|---|
| `text_to_multimodal` | 677 | 10,655 | Generate images + audio from text instructions |
| `multi_step_creation` | 1,500 | 7,907 | Plan multi-step workflows + generate outputs |
| `search_and_generate` | 759 | 7,741 | Use search tools + create visual content |
| `code_and_explain` | 741 | 6,336 | Write code + produce visual explanations |
| `audio_to_image` | 682 | 8,841 | Understand audio + generate matching images |
| `image_to_audio` | 641 | 8,520 | Understand images + generate matching audio |
| **Total** | **5,000** | **50,000** | |
### Key Statistics
- **Avg modalities per sample:** 2.7
- **Avg agentic steps per sample:** 3.2
- **Special tokens used:** `<THINK>`, `<PLAN>`, `<STEP>`, `<IMG0>`-`<IMG4>`, `<AUD0>`-`<AUD8>`, `<VID0>`-`<VID3>`, `<TOOL_CALL>`, `<ACT>`
- **Every sample requires reasoning** (begins with `<THINK>`)
### Sample
```json
{
"id": "magenit_246316",
"conversations": [
{
"role": "user",
"content": "Create both an image and audio illustrating: ancient wisdom and modern chaos"
},
{
"role": "assistant",
"content": "<THINK>I need to create two complementary outputs.</THINK>\n\n<PLAN>\nGoal: Multimodal illustration\n<STEP> Generate descriptive image\n<STEP> Create matching audio\n</PLAN>\n\nVisual:\n<IMG0> <IMG1> <IMG2> <IMG3> <IMG4>\n\nAudio:\n<AUD0> <AUD1> <AUD2> <AUD3>"
}
],
"category": "text_to_multimodal"
}
```
---
## Preference Data for RL Alignment
Used across all **6 RL methods** (DPO, SimPO, GRPO, GRPO++, Online GRPO, MM-DPO).
### Sample
```json
{
"prompt": "Create a multimedia story about a bird migrating...",
"chosen": "<THINK>Complex creative task...</THINK>\n<PLAN>...\n<IMG0>...<AUD0>...",
"rejected": "I can help with that. Here is some information.",
"category": "multi_step_creation",
"degradation": "generic",
"reward_chosen": 0.755,
"reward_rejected": 0.383
}
```
**Degradation types:** `generic` (vague response), `incomplete` (missing modalities), `wrong_modality`, `no_planning` (skips `<THINK>`/`<PLAN>`), `wrong_tool`.
---
## Training Pipeline
```
Stage 1 (Encode) → cc3m/ + audiocaps/ + llava_instruct/train_50k.jsonl
Stage 2 (Decode) → decoder_embeddings/
Stage 3 (SFT) → magenit/train_augmented.jsonl + understanding_sft/ + toolbench/
Stage 4 (RL) × 6 → preferences/train_augmented.jsonl
Evaluation → benchmarks/
```
---
## Usage
```python
from datasets import load_dataset
# MAgenIT (our novel dataset, 50K)
magenit = load_dataset("mr3haque/OmniAgent-Data", data_files="magenit/train_augmented.jsonl", split="train")
print(f"Samples: {len(magenit)}, Categories: {set(magenit['category'])}")
# Preferences for RL (50K)
prefs = load_dataset("mr3haque/OmniAgent-Data", data_files="preferences/train_augmented.jsonl", split="train")
# Stage 1 encoding data
cc3m = load_dataset("mr3haque/OmniAgent-Data", data_files="cc3m/train.jsonl", split="train")
```
---
## Results Achieved with This Data
| Method | PPL ↓ | CMTS ↑ | ACI ↑ | Novel Avg ↑ |
|---|:---:|:---:|:---:|:---:|
| SFT (Stage 3) | 1.92 | 0.931 | 0.817 | 0.747 |
| + SimPO | **1.75** | **0.939** | 0.817 | **0.781** |
| + **MM-DPO (Ours)** | 2.30 | 0.920 | **0.917** | 0.714 |
| + GRPO++ (Ours) | 130.5 | 0.538 | 0.833 | 0.686 |
Full results with all 6 methods: [model card](https://huggingface.co/mr3haque/OmniAgent).
---
## Links
| Resource | URL |
|---|---|
| **Model (all checkpoints)** | [mr3haque/OmniAgent](https://huggingface.co/mr3haque/OmniAgent) |
| **Code** | [github.com/rezwanh001/OmniAgent](https://github.com/rezwanh001/OmniAgent) |
| **Inference notebook** | [OmniAgent_Inference.ipynb](https://huggingface.co/mr3haque/OmniAgent/blob/main/notebooks/OmniAgent_Inference.ipynb) |
---
## Citation
```bibtex
@inproceedings{haque2026omniagent,
title={OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation},
author={Haque, Md Rezwan},
booktitle={CVPR},
year={2026}
}
```
## License
Apache 2.0. CC3M, AudioCaps, LLaVA-Instruct, and WebVid retain their original licenses.
**CPAMI Lab, University of Waterloo** | 2x NVIDIA RTX A6000
提供机构:
mr3haque



