mr3haque/OmniAgent-Data

Name: mr3haque/OmniAgent-Data
Creator: mr3haque
Published: 2026-04-09 04:09:45
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/mr3haque/OmniAgent-Data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - visual-question-answering - text-generation - image-to-text - text-to-image - text-to-audio - text-to-video language: - en tags: - multimodal - instruction-tuning - agentic - any-to-any - preference-optimization - tool-use - reinforcement-learning - mm-dpo - grpo - simpo - cvpr size_categories: - 100K<n<1M pretty_name: OmniAgent Complete Training Data --- # OmniAgent-Data: Complete Training Data **Md Rezwan Haque** | CPAMI Lab, University of Waterloo [![Model](https://img.shields.io/badge/Model-HuggingFace-blue)](https://huggingface.co/mr3haque/OmniAgent) [![GitHub](https://img.shields.io/badge/Code-GitHub-black?logo=github)](https://github.com/rezwanh001/OmniAgent) [![Notebook](https://img.shields.io/badge/Demo-Inference-orange)](https://huggingface.co/mr3haque/OmniAgent/blob/main/notebooks/OmniAgent_Inference.ipynb) This is the **canonical and complete** dataset repository for OmniAgent. It contains **all data** used across the 4-stage training pipeline, including our novel **MAgenIT** dataset and preference data for 6 RL alignment methods. --- ## All Datasets at a Glance | Dataset | Folder | Samples | Training Stage | Description | |---|---|---:|:---:|---| | **MAgenIT** (original) | `magenit/train.jsonl` | 5,000 | 3 (SFT) | Cross-modal agentic instructions (our contribution) | | **MAgenIT** (augmented) | `magenit/train_augmented.jsonl` | 50,000 | 3 (SFT) | Augmented with cross-modal variations | | **Preferences** (original) | `preferences/train.jsonl` | 14,444 | 4 (RL) | Human-curated chosen/rejected pairs | | **Preferences** (augmented) | `preferences/train_augmented.jsonl` | 50,000 | 4 (RL) | Augmented preference pairs | | Understanding SFT | `understanding_sft/train.jsonl` | 10,000 | 3 (SFT) | Multimodal understanding instructions | | ToolBench | `toolbench/train.jsonl` | 4,000 | 3 (SFT) | Tool-use instruction data | | CC3M | `cc3m/train.jsonl` | 100,000 | 1 (Encode) | Image-caption pairs | | CC3M (real captions) | `cc3m/train_real_captions.jsonl` | 20,000 | 1 (Encode) | Original captions | | AudioCaps | `audiocaps/train.jsonl` | 49,838 | 1 (Encode) | Audio-caption pairs | | LLaVA-Instruct | `llava_instruct/train.jsonl` | 394,276 | 1, 3 | Visual instruction tuning | | LLaVA-Instruct (50K) | `llava_instruct/train_50k.jsonl` | 50,000 | 1 (Encode) | Subset for Stage 1 | | Decoder embeddings | `decoder_embeddings/train.jsonl` | 71,000 | 2 (Decode) | Precomputed target embeddings | | WebVid | `webvid/train.jsonl` | -- | 1 (Encode) | Video-caption pairs | | Benchmarks | `benchmarks/` | 600 | Eval | Held-out evaluation data | **Total:** ~820K samples across all splits. --- ## MAgenIT: Our Novel Dataset Contribution **MAgenIT** (Multimodal Agentic Instruction Tuning) is the first dataset designed specifically for training multimodal agents that **reason, plan, generate across modalities, and use tools** within a unified framework. ### 6 Task Categories | Category | Original | Augmented | What the Model Must Do | |---|---:|---:|---| | `text_to_multimodal` | 677 | 10,655 | Generate images + audio from text instructions | | `multi_step_creation` | 1,500 | 7,907 | Plan multi-step workflows + generate outputs | | `search_and_generate` | 759 | 7,741 | Use search tools + create visual content | | `code_and_explain` | 741 | 6,336 | Write code + produce visual explanations | | `audio_to_image` | 682 | 8,841 | Understand audio + generate matching images | | `image_to_audio` | 641 | 8,520 | Understand images + generate matching audio | | **Total** | **5,000** | **50,000** | | ### Key Statistics - **Avg modalities per sample:** 2.7 - **Avg agentic steps per sample:** 3.2 - **Special tokens used:** `<THINK>`, `<PLAN>`, `<STEP>`, `<IMG0>`-`<IMG4>`, `<AUD0>`-`<AUD8>`, `<VID0>`-`<VID3>`, `<TOOL_CALL>`, `<ACT>` - **Every sample requires reasoning** (begins with `<THINK>`) ### Sample ```json { "id": "magenit_246316", "conversations": [ { "role": "user", "content": "Create both an image and audio illustrating: ancient wisdom and modern chaos" }, { "role": "assistant", "content": "<THINK>I need to create two complementary outputs.</THINK>\n\n<PLAN>\nGoal: Multimodal illustration\n<STEP> Generate descriptive image\n<STEP> Create matching audio\n</PLAN>\n\nVisual:\n<IMG0> <IMG1> <IMG2> <IMG3> <IMG4>\n\nAudio:\n<AUD0> <AUD1> <AUD2> <AUD3>" } ], "category": "text_to_multimodal" } ``` --- ## Preference Data for RL Alignment Used across all **6 RL methods** (DPO, SimPO, GRPO, GRPO++, Online GRPO, MM-DPO). ### Sample ```json { "prompt": "Create a multimedia story about a bird migrating...", "chosen": "<THINK>Complex creative task...</THINK>\n<PLAN>...\n<IMG0>...<AUD0>...", "rejected": "I can help with that. Here is some information.", "category": "multi_step_creation", "degradation": "generic", "reward_chosen": 0.755, "reward_rejected": 0.383 } ``` **Degradation types:** `generic` (vague response), `incomplete` (missing modalities), `wrong_modality`, `no_planning` (skips `<THINK>`/`<PLAN>`), `wrong_tool`. --- ## Training Pipeline ``` Stage 1 (Encode) → cc3m/ + audiocaps/ + llava_instruct/train_50k.jsonl Stage 2 (Decode) → decoder_embeddings/ Stage 3 (SFT) → magenit/train_augmented.jsonl + understanding_sft/ + toolbench/ Stage 4 (RL) × 6 → preferences/train_augmented.jsonl Evaluation → benchmarks/ ``` --- ## Usage ```python from datasets import load_dataset # MAgenIT (our novel dataset, 50K) magenit = load_dataset("mr3haque/OmniAgent-Data", data_files="magenit/train_augmented.jsonl", split="train") print(f"Samples: {len(magenit)}, Categories: {set(magenit['category'])}") # Preferences for RL (50K) prefs = load_dataset("mr3haque/OmniAgent-Data", data_files="preferences/train_augmented.jsonl", split="train") # Stage 1 encoding data cc3m = load_dataset("mr3haque/OmniAgent-Data", data_files="cc3m/train.jsonl", split="train") ``` --- ## Results Achieved with This Data | Method | PPL ↓ | CMTS ↑ | ACI ↑ | Novel Avg ↑ | |---|:---:|:---:|:---:|:---:| | SFT (Stage 3) | 1.92 | 0.931 | 0.817 | 0.747 | | + SimPO | **1.75** | **0.939** | 0.817 | **0.781** | | + **MM-DPO (Ours)** | 2.30 | 0.920 | **0.917** | 0.714 | | + GRPO++ (Ours) | 130.5 | 0.538 | 0.833 | 0.686 | Full results with all 6 methods: [model card](https://huggingface.co/mr3haque/OmniAgent). --- ## Links | Resource | URL | |---|---| | **Model (all checkpoints)** | [mr3haque/OmniAgent](https://huggingface.co/mr3haque/OmniAgent) | | **Code** | [github.com/rezwanh001/OmniAgent](https://github.com/rezwanh001/OmniAgent) | | **Inference notebook** | [OmniAgent_Inference.ipynb](https://huggingface.co/mr3haque/OmniAgent/blob/main/notebooks/OmniAgent_Inference.ipynb) | --- ## Citation ```bibtex @inproceedings{haque2026omniagent, title={OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation}, author={Haque, Md Rezwan}, booktitle={CVPR}, year={2026} } ``` ## License Apache 2.0. CC3M, AudioCaps, LLaVA-Instruct, and WebVid retain their original licenses. **CPAMI Lab, University of Waterloo** | 2x NVIDIA RTX A6000

提供机构：

mr3haque

5,000+

优质数据集

54 个

任务类型

进入经典数据集