Voila-million-voice|语音识别数据集|语言模型数据集

魔搭社区2025-06-13 更新2025-05-10 收录

语音识别

语言模型

下载链接：

https://modelscope.cn/datasets/maitrix-org/Voila-million-voice

下载链接

链接失效反馈

资源简介：

Voila: Voice-Language Foundation Models 💜 Project Page &nbsp&nbsp ｜ &nbsp&nbsp 🖥️ GitHub &nbsp&nbsp | &nbsp&nbsp🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp 📑 Paper &nbsp&nbsp | &nbsp&nbsp 🌐 Online Demo &nbsp&nbsp| &nbsp&nbsp 🏠Maitrix.org Voila is a new family of large voice-language foundation models aiming to lift human-AI interaction experiences to the next level. Breaking away from the constraints of traditional voice AI systems—high latency, loss of vocal nuances, and mechanical responses—Voila employs an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach enables real-time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times. Combining advanced voice and language modeling, Voila offers customizable, persona-driven engagements and excels in a range of audio tasks from ASR and TTS to speech translation across six languages. With the online [web demo](https://huggingface.co/spaces/maitrix-org/Voila-demo), Voila invites you to explore a transformative, natural dialogue experience between human and AI. # ✨ Highlights - ⭐ High-fidelity, low-latency, real-time streaming audio processing - ⭐ Effective integration of voice and language modeling capabilities - ⭐ Millions of pre-built and custom voices, fast voice switching during conversation - ⭐ Unified model for various audio tasks # 🎥 Video Demo [![Voila Demo](https://img.youtube.com/vi/J27M9-g5KL0/0.jpg)](https://www.youtube.com/watch?v=J27M9-g5KL0) # 🔥 Latest News!! * April 28, 2025: 👋 We've released the inference code and model weights of Voila. # ⚙️ Foundation Models | Model | Description | Download Link | |--------|-----------|-----------------| |Voila-base|Voila base model|https://huggingface.co/maitrix-org/Voila-base| |Voila-Chat|End-to-end audio chat model|https://huggingface.co/maitrix-org/Voila-chat| |Voila-Autonomous (preview)|Full-duplex audio chat model|https://huggingface.co/maitrix-org/Voila-autonomous-preview| |Voila-Audio-alpha|Empowering LLM with raw audio input|https://huggingface.co/maitrix-org/Voila-audio-alpha| |Voila-Tokenizer|Audio tokenizer|https://huggingface.co/maitrix-org/Voila-Tokenizer| ## Usage ### CLI demo ```shell for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do # Text chat python infer.py \ --model-name ${model_name} \ --instruction "" \ --input-text "Hello" \ --task-type chat_tito # Voice chat python infer.py \ --model-name ${model_name} \ --instruction "" \ --input-audio "examples/test1.mp3" \ --task-type chat_aiao done # Autonomous mode python infer.py \ --model-name "maitrix-org/Voila-autonomous-preview" \ --instruction "" \ --input-audio "examples/test_autonomous1.mp3" \ --task-type chat_aiao_auto ``` ### Gradio demo ```shell python gradio_demo.py ``` For more information, please refer to the [code repository](https://github.com/maitrix-org/Voila). # 📁 Datasets We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila-Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre-built and customizable voices. | Dataset | Description | Download Link | |--------|-----------|-----------------| |Voila Benchmark| Evaluation of Voila Benchmark | https://huggingface.co/datasets/maitrix-org/Voila-Benchmark | |Voila Voice Library| Millons of pre-build voices | https://huggingface.co/datasets/maitrix-org/Voila-million-voice # 📊 Benchmark ## 1. Voila Benchmark We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ-Open, and GSM8k. We compare our results with SpeechGPT and Moshi. | Model | Voila Benchmark | |-------|----------------| |SpeechGPT| 13.29| |Moshi | 11.45 | |**Voila** | **30.56** | _(higher is better)_ For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark"). ## 2. Evaluation of ASR As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text-to-Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS. For ASR, we assess performance on the LibriSpeech test-clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%. | Model | LibriSpeech test-clean (WER) | |-------|-----------------------| |Whisper large v2|2.7| |Whisper large v3|2.2| |FastConformer|3.6| |VoxtLM |2.7| |Moshi |5.7| |**Voila (w/o LibriSpeech train split)** |**4.8**| |**Voila (with LibriSpeech train split)**|**2.7**| _(lower is better)_ ## 3. Evaluation of TTS For TTS, we follow the evaluation metrics proposed in Vall-E, which involves transcribing the generated audio using HuBERT-Large. Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data). | Model | LibriSpeech test-clean (WER) | |-------|-----------------------| |YourTTS |7.7| |Vall-E|5.9| |Moshi|4.7| |**Voila (w/o LibriSpeech train split)** |**3.2**| |**Voila (with LibriSpeech train split)** |**2.8**| _(lower is better)_ # 📝 Citation If you find our work helpful, please cite us. ``` @article{voila2025, author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu}, title = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay}, eprint={2505.02707}, archivePrefix={arXiv}, primaryClass={cs.CL}, year = {2025} } ```

提供机构：

maas

创建时间：

2025-05-08

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

学生课堂行为数据集 (SCB-dataset3)

学生课堂行为数据集(SCB-dataset3)由成都东软学院创建，包含5686张图像和45578个标签，重点关注六种行为：举手、阅读、写作、使用手机、低头和趴桌。数据集覆盖从幼儿园到大学的不同场景，通过YOLOv5、YOLOv7和YOLOv8算法评估，平均精度达到80.3%。该数据集旨在为学生行为检测研究提供坚实基础，解决教育领域中学生行为数据集的缺乏问题。

arXiv 收录

Google Scholar

Google Scholar是一个学术搜索引擎，旨在检索学术文献、论文、书籍、摘要和文章等。它涵盖了广泛的学科领域，包括自然科学、社会科学、艺术和人文学科。用户可以通过关键词搜索、作者姓名、出版物名称等方式查找相关学术资源。

scholar.google.com 收录

中国车牌识别数据集（7类，33万张）

这是一个高质量、平衡的中国车牌识别数据集，包含了33万张各类中国车牌的图片。数据集经过精心设计，确保了图像质量的优秀和大部分各类车牌类型的平衡分布。这个数据集非常适合用于训练和评估车牌识别模型。

魔搭社区收录

YOLO Drone Detection Dataset

为了促进无人机检测模型的开发和评估，我们引入了一个新颖且全面的数据集，专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集，包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象，以实现强大的检测和分类。

github 收录

Subway Dataset

该数据集包含了全球多个城市的地铁系统数据，包括车站信息、线路图、列车时刻表、乘客流量等。数据集旨在帮助研究人员和开发者分析和模拟城市交通系统，优化地铁运营和乘客体验。