cagataydev/vlm-voice-commands
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cagataydev/vlm-voice-commands
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-to-speech
- robotics
language:
- en
tags:
- voice-commands
- robotics
- VLM
- embodied-AI
- manipulation
- navigation
- pick-and-place
- human-robot-interaction
size_categories:
- 10K<n<100K
---
# 🎙️ VLM Robotics Voice Commands (Text)
**50,000 curated natural language commands for Vision-Language-Model robot control.**
This dataset contains diverse text commands that humans would speak to a robot,
covering 10 categories of embodied interaction.
## 🎯 Purpose
- Training VLMs to understand spoken robot instructions
- TTS source for generating audio training data
- Benchmark for robot language understanding
- Coverage of the full spectrum of human→robot speech
## 📊 Statistics
| Metric | Value |
|--------|-------|
| **Total Commands** | 50,000 |
| **Categories** | 10 |
| **Avg Word Count** | 8.8 words |
| **Language** | English |
| **Unique** | 100% (deduplicated) |
### Category Distribution
| Category | Count | % | Examples |
|----------|------:|---:|----------|
| pick_place | 18,570 | 37.1% | "Pick up the red cube", "Bring me the bottle" |
| multistep | 8,464 | 16.9% | "Open the drawer, take the cup, close it" |
| manipulation | 5,557 | 11.1% | "Pour water into the glass", "Fold the towel" |
| navigation | 5,178 | 10.4% | "Go to the kitchen", "Turn left" |
| observation | 5,008 | 10.0% | "What do you see?", "Count the objects" |
| spatial | 3,506 | 7.0% | "Move arm left 5cm", "Lower the gripper" |
| household | 1,937 | 3.9% | "Clean the table", "Set table for dinner" |
| safety | 839 | 1.7% | "Stop!", "Be careful" |
| conversational | 519 | 1.0% | "Good job", "Try again" |
| context_rich | 422 | 0.8% | "Grab that", "The one I pointed at" |
## 🏗️ Schema
| Column | Type | Description |
|--------|------|-------------|
| uid=503(cagatay) gid=20(staff) groups=20(staff),101(access_bpf),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),98(_lpadmin),702(com.apple.sharepoint.group.2),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),399(com.apple.access_ssh),400(com.apple.access_remote_ae),701(com.apple.sharepoint.group.1) | string | Unique ID (vlm_000000 format) |
| | string | The voice command text |
| | string | Command category |
| | string | easy / medium / hard |
| | string | Suggested TTS voice (NATM0-3, NATF0-3) |
| | int | Number of words |
## 🗣️ Natural Variations
Commands include natural speech patterns:
- **Polite**: "Could you pick up the cup?"
- **Casual**: "Hey, grab that"
- **Urgent**: "Stop right now!"
- **Questioning**: "Can you reach that?"
- **Compound**: "Pick it up and bring it to me, okay?"
## 📝 Difficulty Levels
| Level | Description | Example |
|-------|-------------|---------|
| **Easy** | Single simple action, safety | "Stop", "Good job" |
| **Medium** | Standard pick/place/nav | "Pick up the red cube" |
| **Hard** | Multi-step, context-dependent | "Find the cup, show me, put it away" |
## 🔧 Usage
## 📦 Related
| Dataset | Description |
|---------|-------------|
| [cagataydev/vlm-voice-audio](https://huggingface.co/datasets/cagataydev/vlm-voice-audio) | Audio version with TTS |
| [cagataydev/q-omni-data-soup](https://huggingface.co/datasets/cagataydev/q-omni-data-soup) | Multi-modal training soup |
Built with [DevDuck](https://github.com/cagataycali/devduck) 🦆
---
*Generated on 2026-03-22*
提供机构:
cagataydev



