OmniAction-LIBERO

Name: OmniAction-LIBERO
Creator: maas
Published: 2026-05-19 18:56:44
License: 暂无描述

魔搭社区2026-05-19 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/openmoss/OmniAction-LIBERO

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <h1> RoboOmni: Proactive Robot Manipulation in Omni-modal Context </h1> </div> <p align="center"> 📖 <a href="https://arxiv.org/pdf/2510.23763"><strong>arXiv Paper</strong></a> (Accepted to ICLR 2026 🎉) | 🌐 <a href="https://OpenMOSS.github.io/RoboOmni"><strong>Website</strong></a> | 🤗 <a href="https://huggingface.co/fnlp/RoboOmni"><strong>Model</strong></a> | 🤗 <a href="https://huggingface.co/datasets/fnlp/OmniAction"><strong>Dataset</strong></a> | 🛠️ <a href="https://github.com/OpenMOSS/RoboOmni"><strong>Github</strong></a> | </p> ![logo](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/Lb55aSaitdpNl1iSC8xrm.png) --- Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision–Language–Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce *cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.* To address this new setting, we present **RoboOmni**, a *Perceiver-Thinker-Talker-Executor* framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build **OmniAction** comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. --- ## 📦 OmniAction Dataset ![data](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/sIzK3U5RqonwjQCPEglfi.jpeg) We introduce OmniAction, a large-scale multimodal dataset for contextual instruction following. It comprises 141,162 episodes covering 112 skills and 748 objects, enriched with 5,096 distinct speaker timbres, 2,482 non-verbal sound events, and 640 environmental backgrounds. The dataset spans six categories of contextual instructions—sentiment cues, overlapping voices, non-verbal cues, identity cues, dyadic dialogue, and triadic dialogue—capturing both subtle affective signals and complex multi-party interactions in everyday settings. - **Format**: RLDS (Reinforcement Learning Datasets standard). - **Audio**: Sorted according to filename. - ## ⭐️ Architecture At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution. ![WechatIMG2567](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/z5hqTgAPU0BiFtdrKwq8A.jpeg) ## 👋 Citation If you find our paper and code useful in your research, please cite our paper. ```bibtex @article{wang2025roboomni, title={RoboOmni: Proactive Robot Manipulation in Omni-modal Context}, author={Siyin Wang and Jinlan Fu and Feihong Liu and Xinzhe He and Huangxuan Wu and Junhao Shi and Kexin Huang and Zhaoye Fei and Jingjing Gong and Zuxuan Wu and Yugang Jiang and See-Kiong Ng and Tat-Seng Chua and Xipeng Qiu}, journal={arXiv preprint arXiv:2510.23763}, year={2025}, url={https://arxiv.org/abs/2510.23763}, archivePrefix={arXiv}, primaryClass={cs.RO}, } ```

⚠️ 重要提示：数据集上传进度由于本数据集体量较大，我们将采用分批次上传的方式进行部署。敬请持续关注并后续刷新查看更新，待全部上传完成后，即可获取完整数据集。感谢您的耐心与理解！ <div align="center"> <h1> RoboOmni：全模态场景下的主动式机器人操控 </h1> </div> <p align="center"> 📖 <a href="https://arxiv.org/pdf/2510.23763"><strong>arXiv 论文</strong></a> | 🌐 <a href="https://OpenMOSS.github.io/RoboOmni"><strong>项目官网</strong></a> | 🤗 <a href="https://huggingface.co/fnlp/RoboOmni"><strong>模型仓库</strong></a> | 🤗 <a href="https://huggingface.co/datasets/fnlp/OmniAction"><strong>数据集仓库</strong></a> | 🛠️ <a href="https://github.com/OpenMOSS/RoboOmni"><strong>Github 代码仓库</strong></a> | </p> ![logo](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/Lb55aSaitdpNl1iSC8xrm.png) --- 多模态大语言模型（Multimodal Large Language Models，MLLMs）的近期进展，推动了机器人操控领域视觉-语言-行动（Vision–Language–Action，VLA）模型的快速发展。尽管现有方法在诸多场景中表现有效，但它们大多依赖显式指令；而在真实人际交互中，人类极少直接发出明确指令。高效的人机协作要求机器人能够主动推断用户意图。本研究提出**跨模态上下文指令**这一全新设置：意图并非来自显式命令，而是源于口语对话、环境音效与视觉线索。为适配这一全新设置，我们提出**RoboOmni**框架：这是一种基于端到端全模态大语言模型的「感知-思考-交互-执行」（Perceiver-Thinker-Talker-Executor）架构，可统一实现意图识别、交互确认与行动执行三大功能。RoboOmni可在时空维度融合听觉与视觉信号，实现鲁棒的意图识别，同时支持直接语音交互。针对机器人操控领域主动意图识别训练数据匮乏的问题，我们构建了**OmniAction**数据集，该数据集包含14万个交互片段、5000余名发言者、2400余种事件音效、640种场景背景，以及6类上下文指令类型。仿真与真实环境下的实验结果表明，相较于基于文本与自动语音识别（Automatic Speech Recognition，ASR）的基线模型，RoboOmni在成功率、推理速度、意图识别精度与主动辅助能力上均更优。 --- ## 📦 OmniAction 数据集 ![data](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/sIzK3U5RqonwjQCPEglfi.jpeg) 本研究提出的OmniAction是一款面向上下文指令跟随任务的大规模多模态数据集。其包含141162个交互片段，涵盖112项操控技能与748个操作对象，同时配有5096种独特的发言者音色、2482种非语言声音事件以及640种环境背景。该数据集涵盖6类上下文指令：情感线索、重叠语音、非语言线索、身份线索、双人对话与三人对话，可捕捉日常场景中微妙的情感信号与复杂的多方交互行为。 - **数据格式**：采用强化学习数据集标准（Reinforcement Learning Datasets standard，RLDS）。 - **音频**：按文件名排序。 --- ## ⭐️ 架构设计 RoboOmni的核心为「感知-思考-交互-执行」（Perceiver-Thinker-Talker-Executor）架构，该架构将视觉、语音、环境音效等多模态信息统一整合至单一无缝框架中，用于机器人行动执行。 ![WechatIMG2567](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/z5hqTgAPU0BiFtdrKwq8A.jpeg) --- ## 👋 引用声明如果您在研究中使用了本论文与代码，请引用我们的工作。 bibtex @article{wang2025roboomni, title={RoboOmni: Proactive Robot Manipulation in Omni-modal Context}, author={Siyin Wang and Jinlan Fu and Feihong Liu and Xinzhe He and Huangxuan Wu and Junhao Shi and Kexin Huang and Zhaoye Fei and Jingjing Gong and Zuxuan Wu and Yugang Jiang and See-Kiong Ng and Tat-Seng Chua and Xipeng Qiu}, journal={arXiv preprint arXiv:2510.23763}, year={2025}, url={https://arxiv.org/abs/2510.23763}, archivePrefix={arXiv}, primaryClass={cs.RO}, }

提供机构：

maas

创建时间：

2025-10-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集