cagataydev/vlm-voice-audio
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cagataydev/vlm-voice-audio
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
- text-to-speech
- robotics
language:
- en
tags:
- voice-commands
- robotics
- VLM
- embodied-AI
- speech
- manipulation
- navigation
- pick-and-place
- human-robot-interaction
size_categories:
- 10K<n<100K
---
# 🎙️ VLM Robotics Voice Commands (Audio)
**Natural speech commands for Vision-Language-Model robot control.**
This dataset contains **9,999** audio recordings of human voice commands
for controlling robots — covering pick & place, navigation, manipulation, observation,
multi-step tasks, spatial commands, safety, household chores, and conversational feedback.
## 🎯 Purpose
Training omni-modal VLMs that understand **spoken** robot commands. The audio captures
natural speech patterns including:
- Polite forms: *"Could you pick up the red cup?"*
- Casual: *"Hey, grab that"*
- Urgent: *"Stop immediately!"*
- Multi-step: *"Open the drawer, take out the spoon, and close it"*
- Context-dependent: *"The one I pointed at"* (requires vision)
## 📊 Statistics
| Metric | Value |
|--------|-------|
| **Total Examples** | 9,999 |
| **Total Audio** | 6.8 hours |
| **Avg Duration** | 2.5s |
| **Sample Rate** | 44.1 kHz |
| **Format** | WAV (16-bit) |
| **Language** | English |
### Category Distribution
| Category | Count | % |
|----------|------:|---:|
| pick_place | 2,729 | 27.3% |
| manipulation | 1,446 | 14.5% |
| multistep | 1,330 | 13.3% |
| navigation | 1,179 | 11.8% |
| observation | 1,025 | 10.3% |
| spatial | 794 | 7.9% |
| household | 584 | 5.8% |
| safety | 356 | 3.6% |
| conversational | 295 | 3.0% |
| context_rich | 261 | 2.6% |
### Voice Distribution
| Voice | Count | % |
|-------|------:|---:|
| NATM1 | 5,021 | 50.2% |
| NATF0 | 1,072 | 10.7% |
| NATM0 | 963 | 9.6% |
| NATF1 | 772 | 7.7% |
| NATM2 | 636 | 6.4% |
| NATF2 | 576 | 5.8% |
| NATM3 | 481 | 4.8% |
| NATF3 | 478 | 4.8% |
### Difficulty Distribution
| Difficulty | Count |
|-----------|------:|
| Easy | 651 |
| Medium | 7,757 |
| Hard | 1,591 |
## 🏗️ Schema
| Column | Type | Description |
|--------|------|-------------|
| `audio` | Audio | WAV audio at 44.1 kHz |
| `text` | string | Transcript of the spoken command |
| `voice` | string | Speaker voice ID (NATM0-3, NATF0-3) |
| `category` | string | Command category (pick_place, navigation, etc.) |
| `difficulty` | string | easy / medium / hard |
| `duration_seconds` | float | Audio duration in seconds |
| `id` | string | Unique example ID |
## 🗣️ Voice Profiles
| Voice | Description |
|-------|-------------|
| NATM1 | Natural male — calm, clear (primary) |
| NATM0 | Natural male — low-pitched, measured |
| NATM2 | Natural male — warm, friendly |
| NATM3 | Natural male — deep, authoritative |
| NATF0 | Natural female — clear, professional |
| NATF1 | Natural female — energetic, upbeat |
| NATF2 | Natural female — soft, thoughtful |
| NATF3 | Natural female — bright, enthusiastic |
## 📝 Example Commands
**Pick & Place:**
- *"Pick up the red cube on the table"*
- *"Grab the bottle from the shelf and bring it here"*
- *"Could you move the blue box to the left side?"*
**Navigation:**
- *"Go to the kitchen"*
- *"Move forward two meters"*
- *"Turn left and face the door"*
**Manipulation:**
- *"Open the drawer slowly"*
- *"Pour water into the cup"*
- *"Fold the towel and put it away"*
**Multi-step:**
- *"First go to the counter, pick up the mug, then bring it to me"*
- *"Clear the table and organize everything in the bin"*
**Safety:**
- *"Stop immediately!"*
- *"Be careful, that's fragile"*
- *"Slow down"*
## 🔧 Usage
```python
from datasets import load_dataset
ds = load_dataset("cagataydev/vlm-voice-audio", split="train")
# Listen to a sample
example = ds[0]
print(f"Text: {example['text']}")
print(f"Category: {example['category']}")
# example["audio"] contains the waveform array + sampling_rate
```
## 📦 Related Datasets
| Dataset | Description |
|---------|-------------|
| [cagataydev/vlm-voice-commands](https://huggingface.co/datasets/cagataydev/vlm-voice-commands) | Text-only version (50K commands) |
| [cagataydev/omni-voice-training](https://huggingface.co/datasets/cagataydev/omni-voice-training) | General Q voice training data |
| [cagataydev/q-omni-data-soup](https://huggingface.co/datasets/cagataydev/q-omni-data-soup) | Multi-modal training dataset |
## 🏭 Generation
Generated using **Parler-TTS Mini v1.1** on NVIDIA L40S (46GB VRAM).
Speaker characteristics controlled via natural language descriptions.
Source text from curated template-based generation covering 10 robotics command categories.
Built with [DevDuck](https://github.com/cagataycali/devduck) 🦆
---
*Generated on 2026-03-23*
提供机构:
cagataydev



