five

OmniAction

收藏
魔搭社区2026-05-21 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/openmoss/OmniAction
下载链接
链接失效反馈
官方服务:
资源简介:
> **🚨!!Note: Dataset Upload Progress** Due to the large size of the dataset, we are uploading it in batches. Please stay tuned and check back for updates. Once the upload is complete, the full dataset will be accessible. Thank you for your patience and understanding! <div align="center"> <h1> RoboOmni: Proactive Robot Manipulation in Omni-modal Context </h1> </div> <p align="center"> 📖 <a href="https://arxiv.org/pdf/2510.23763"><strong>arXiv Paper</strong></a> | 🌐 <a href="https://OpenMOSS.github.io/RoboOmni"><strong>Website</strong></a> | 🤗 <a href="https://huggingface.co/fnlp/RoboOmni"><strong>Model</strong></a> | 🤗 <a href="https://huggingface.co/datasets/fnlp/OmniAction"><strong>Dataset</strong></a> | 🛠️ <a href="https://github.com/OpenMOSS/RoboOmni"><strong>Github</strong></a> | </p> ![logo](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/Lb55aSaitdpNl1iSC8xrm.png) --- Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision–Language–Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce *cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands.* To address this new setting, we present **RoboOmni**, a *Perceiver-Thinker-Talker-Executor* framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build **OmniAction** comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. --- ## 📦 OmniAction Dataset ![data](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/sIzK3U5RqonwjQCPEglfi.jpeg) We introduce OmniAction, a large-scale multimodal dataset for contextual instruction following. It comprises 141,162 episodes covering 112 skills and 748 objects, enriched with 5,096 distinct speaker timbres, 2,482 non-verbal sound events, and 640 environmental backgrounds. The dataset spans six categories of contextual instructions—sentiment cues, overlapping voices, non-verbal cues, identity cues, dyadic dialogue, and triadic dialogue—capturing both subtle affective signals and complex multi-party interactions in everyday settings. - **Format**: RLDS (Reinforcement Learning Datasets standard). - **Audio**: Sorted according to filename. - ## ⭐️ Architecture At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution. ![WechatIMG2567](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/z5hqTgAPU0BiFtdrKwq8A.jpeg) ## 👋 Citation If you find our paper and code useful in your research, please cite our paper. ```bibtex @article{wang2025roboomni, title={RoboOmni: Proactive Robot Manipulation in Omni-modal Context}, author={Siyin Wang and Jinlan Fu and Feihong Liu and Xinzhe He and Huangxuan Wu and Junhao Shi and Kexin Huang and Zhaoye Fei and Jingjing Gong and Zuxuan Wu and Yugang Jiang and See-Kiong Ng and Tat-Seng Chua and Xipeng Qiu}, journal={arXiv preprint arXiv:2510.23763}, year={2025}, url={https://arxiv.org/abs/2510.23763}, archivePrefix={arXiv}, primaryClass={cs.RO}, } ```

⚠️ 注意:数据集上传进度 由于数据集体量较大,我们将采用分批上传的方式进行部署。敬请持续关注并后续查看更新,上传完成后即可获取完整数据集。感谢您的耐心与理解! <div align="center"> <h1> RoboOmni:全模态场景下的主动式机器人操作 </h1> </div> <p align="center"> 📖 <a href="https://arxiv.org/pdf/2510.23763"><strong>学术论文</strong></a> | 🌐 <a href="https://OpenMOSS.github.io/RoboOmni"><strong>项目主页</strong></a> | 🤗 <a href="https://huggingface.co/fnlp/RoboOmni"><strong>模型仓库</strong></a> | 🤗 <a href="https://huggingface.co/datasets/fnlp/OmniAction"><strong>数据集仓库</strong></a> | 🛠️ <a href="https://github.com/OpenMOSS/RoboOmni"><strong>代码仓库</strong></a> | </p> ![logo](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/Lb55aSaitdpNl1iSC8xrm.png) --- 多模态大语言模型(Multimodal Large Language Models, MLLMs)的近期进展推动了面向机器人操作的视觉-语言-动作(Vision–Language–Action, VLA)模型的快速发展。尽管现有方法在诸多场景中表现出色,但它们大多依赖显式指令;而在真实交互中,人类极少直接下达指令。高效的人机协作要求机器人能够主动推断用户意图。 本工作提出**跨模态上下文指令**这一全新研究设定:意图不再来自显式命令,而是通过口语对话、环境音效与视觉线索推导得出。为适配这一设定,我们推出**RoboOmni**——一种基于端到端全模态大语言模型的「感知-思考-交互-执行」框架,统一了意图识别、交互确认与动作执行三大模块。RoboOmni能够在时空维度融合听觉与视觉信号,实现鲁棒的意图识别,同时支持直接语音交互。 针对机器人操作中主动意图识别缺乏训练数据的痛点,我们构建了**OmniAction**数据集,包含14万个交互回合、5000余名发言者、2400余种事件音效、640种环境背景,以及6类上下文指令场景。在仿真与真实环境的实验结果表明,RoboOmni在成功率、推理速度、意图识别能力与主动辅助性能上均优于基于文本与自动语音识别(Automatic Speech Recognition, ASR)的基线模型。 --- ## 📦 OmniAction 数据集 ![data](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/sIzK3U5RqonwjQCPEglfi.jpeg) 本数据集OmniAction是一款面向上下文指令跟随任务的大规模多模态数据集,包含141162个交互回合,覆盖112项操作技能与748种物体,同时附带5096种独特的发言者音色、2482种非语言音效事件与640种环境背景。数据集涵盖6类上下文指令场景:情感线索、重叠语音、非语言线索、身份线索、二人对话与三人对话,能够捕捉日常场景中微妙的情感信号与复杂的多方交互行为。 - **数据格式**:RLDS(强化学习数据集标准,Reinforcement Learning Datasets standard) - **音频**:按文件名排序。 ## ⭐️ 架构 RoboOmni的核心是「感知-思考-交互-执行」架构,将视觉、语音、环境音效等多模态信息统一至单一无缝的框架中,实现机器人动作执行。 ![WechatIMG2567](https://cdn-uploads.huggingface.co/production/uploads/64c3c631e77ea9f28111172a/z5hqTgAPU0BiFtdrKwq8A.jpeg) ## 👋 引用规范 若您的研究中使用了本工作的论文与代码,请引用如下文献: bibtex @article{wang2025roboomni, title={RoboOmni: Proactive Robot Manipulation in Omni-modal Context}, author={Siyin Wang and Jinlan Fu and Feihong Liu and Xinzhe He and Huangxuan Wu and Junhao Shi and Kexin Huang and Zhaoye Fei and Jingjing Gong and Zuxuan Wu and Yugang Jiang and See-Kiong Ng and Tat-Seng Chua and Xipeng Qiu}, journal={arXiv preprint arXiv:2510.23763}, year={2025}, url={https://arxiv.org/abs/2510.23763}, archivePrefix={arXiv}, primaryClass={cs.RO}, }
提供机构:
maas
创建时间:
2025-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作