cagataydev/vlm-voice-audio

Name: cagataydev/vlm-voice-audio
Creator: cagataydev
Published: 2026-03-23 20:34:42
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/cagataydev/vlm-voice-audio

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - automatic-speech-recognition - text-to-speech - robotics language: - en tags: - voice-commands - robotics - VLM - embodied-AI - speech - manipulation - navigation - pick-and-place - human-robot-interaction size_categories: - 10K<n<100K --- # 🎙️ VLM Robotics Voice Commands (Audio) **Natural speech commands for Vision-Language-Model robot control.** This dataset contains **9,999** audio recordings of human voice commands for controlling robots — covering pick & place, navigation, manipulation, observation, multi-step tasks, spatial commands, safety, household chores, and conversational feedback. ## 🎯 Purpose Training omni-modal VLMs that understand **spoken** robot commands. The audio captures natural speech patterns including: - Polite forms: *"Could you pick up the red cup?"* - Casual: *"Hey, grab that"* - Urgent: *"Stop immediately!"* - Multi-step: *"Open the drawer, take out the spoon, and close it"* - Context-dependent: *"The one I pointed at"* (requires vision) ## 📊 Statistics | Metric | Value | |--------|-------| | **Total Examples** | 9,999 | | **Total Audio** | 6.8 hours | | **Avg Duration** | 2.5s | | **Sample Rate** | 44.1 kHz | | **Format** | WAV (16-bit) | | **Language** | English | ### Category Distribution | Category | Count | % | |----------|------:|---:| | pick_place | 2,729 | 27.3% | | manipulation | 1,446 | 14.5% | | multistep | 1,330 | 13.3% | | navigation | 1,179 | 11.8% | | observation | 1,025 | 10.3% | | spatial | 794 | 7.9% | | household | 584 | 5.8% | | safety | 356 | 3.6% | | conversational | 295 | 3.0% | | context_rich | 261 | 2.6% | ### Voice Distribution | Voice | Count | % | |-------|------:|---:| | NATM1 | 5,021 | 50.2% | | NATF0 | 1,072 | 10.7% | | NATM0 | 963 | 9.6% | | NATF1 | 772 | 7.7% | | NATM2 | 636 | 6.4% | | NATF2 | 576 | 5.8% | | NATM3 | 481 | 4.8% | | NATF3 | 478 | 4.8% | ### Difficulty Distribution | Difficulty | Count | |-----------|------:| | Easy | 651 | | Medium | 7,757 | | Hard | 1,591 | ## 🏗️ Schema | Column | Type | Description | |--------|------|-------------| | `audio` | Audio | WAV audio at 44.1 kHz | | `text` | string | Transcript of the spoken command | | `voice` | string | Speaker voice ID (NATM0-3, NATF0-3) | | `category` | string | Command category (pick_place, navigation, etc.) | | `difficulty` | string | easy / medium / hard | | `duration_seconds` | float | Audio duration in seconds | | `id` | string | Unique example ID | ## 🗣️ Voice Profiles | Voice | Description | |-------|-------------| | NATM1 | Natural male — calm, clear (primary) | | NATM0 | Natural male — low-pitched, measured | | NATM2 | Natural male — warm, friendly | | NATM3 | Natural male — deep, authoritative | | NATF0 | Natural female — clear, professional | | NATF1 | Natural female — energetic, upbeat | | NATF2 | Natural female — soft, thoughtful | | NATF3 | Natural female — bright, enthusiastic | ## 📝 Example Commands **Pick & Place:** - *"Pick up the red cube on the table"* - *"Grab the bottle from the shelf and bring it here"* - *"Could you move the blue box to the left side?"* **Navigation:** - *"Go to the kitchen"* - *"Move forward two meters"* - *"Turn left and face the door"* **Manipulation:** - *"Open the drawer slowly"* - *"Pour water into the cup"* - *"Fold the towel and put it away"* **Multi-step:** - *"First go to the counter, pick up the mug, then bring it to me"* - *"Clear the table and organize everything in the bin"* **Safety:** - *"Stop immediately!"* - *"Be careful, that's fragile"* - *"Slow down"* ## 🔧 Usage ```python from datasets import load_dataset ds = load_dataset("cagataydev/vlm-voice-audio", split="train") # Listen to a sample example = ds[0] print(f"Text: {example['text']}") print(f"Category: {example['category']}") # example["audio"] contains the waveform array + sampling_rate ``` ## 📦 Related Datasets | Dataset | Description | |---------|-------------| | [cagataydev/vlm-voice-commands](https://huggingface.co/datasets/cagataydev/vlm-voice-commands) | Text-only version (50K commands) | | [cagataydev/omni-voice-training](https://huggingface.co/datasets/cagataydev/omni-voice-training) | General Q voice training data | | [cagataydev/q-omni-data-soup](https://huggingface.co/datasets/cagataydev/q-omni-data-soup) | Multi-modal training dataset | ## 🏭 Generation Generated using **Parler-TTS Mini v1.1** on NVIDIA L40S (46GB VRAM). Speaker characteristics controlled via natural language descriptions. Source text from curated template-based generation covering 10 robotics command categories. Built with [DevDuck](https://github.com/cagataycali/devduck) 🦆 --- *Generated on 2026-03-23*

提供机构：

cagataydev

5,000+

优质数据集

54 个

任务类型

进入经典数据集