Voila-million-voice
收藏魔搭社区2025-11-27 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/maitrix-org/Voila-million-voice
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<img src="https://voila.maitrix.org/static/images/logo.png" width="400"/><br/>
<b>Voila: <span style="color:#ca00f9">Voi</span>ce-<span style="color:#ca00f9">La</span>nguage Foundation Models</b><br/><br/>
💜 <a href="https://voila.maitrix.org"><b>Project Page</b></a>    |    🖥️ <a href="https://github.com/maitrix-org/Voila">GitHub</a>    |   🤗 <a href="https://huggingface.co/collections/maitrix-org/voila-67e0d96962c19f221fc73fa5">Hugging Face</a>   |    📑 <a href="http://arxiv.org/abs/2505.02707">Paper</a>    |    🌐 <a href="https://huggingface.co/spaces/maitrix-org/Voila-demo">Online Demo</a>   |    🏠<a href="https://maitrix.org">Maitrix.org</a>
</p>
Voila is a new family of large voice-language foundation models aiming to lift human-AI interaction experiences to the next level. Breaking away from the constraints of traditional voice AI systems—high latency, loss of vocal nuances, and mechanical responses—Voila employs an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach enables real-time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times. Combining advanced voice and language modeling, Voila offers customizable, persona-driven engagements and excels in a range of audio tasks from ASR and TTS to speech translation across six languages. With the online [web demo](https://huggingface.co/spaces/maitrix-org/Voila-demo), Voila invites you to explore a transformative, natural dialogue experience between human and AI.
# ✨ Highlights
- ⭐ High-fidelity, low-latency, real-time streaming audio processing
- ⭐ Effective integration of voice and language modeling capabilities
- ⭐ Millions of pre-built and custom voices, fast voice switching during conversation
- ⭐ Unified model for various audio tasks
# 🎥 Video Demo
[](https://www.youtube.com/watch?v=J27M9-g5KL0)
# 🔥 Latest News!!
* April 28, 2025: 👋 We've released the inference code and model weights of Voila.
# ⚙️ Foundation Models
| Model | Description | Download Link |
|--------|-----------|-----------------|
|Voila-base|Voila base model|https://huggingface.co/maitrix-org/Voila-base|
|Voila-Chat|End-to-end audio chat model|https://huggingface.co/maitrix-org/Voila-chat|
|Voila-Autonomous (preview)|Full-duplex audio chat model|https://huggingface.co/maitrix-org/Voila-autonomous-preview|
|Voila-Audio-alpha|Empowering LLM with raw audio input|https://huggingface.co/maitrix-org/Voila-audio-alpha|
|Voila-Tokenizer|Audio tokenizer|https://huggingface.co/maitrix-org/Voila-Tokenizer|
## Usage
### CLI demo
```shell
for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
# Text chat
python infer.py \
--model-name ${model_name} \
--instruction "" \
--input-text "Hello" \
--task-type chat_tito
# Voice chat
python infer.py \
--model-name ${model_name} \
--instruction "" \
--input-audio "examples/test1.mp3" \
--task-type chat_aiao
done
# Autonomous mode
python infer.py \
--model-name "maitrix-org/Voila-autonomous-preview" \
--instruction "" \
--input-audio "examples/test_autonomous1.mp3" \
--task-type chat_aiao_auto
```
### Gradio demo
```shell
python gradio_demo.py
```
For more information, please refer to the [code repository](https://github.com/maitrix-org/Voila).
# 📁 Datasets
We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila-Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre-built and customizable voices.
| Dataset | Description | Download Link |
|--------|-----------|-----------------|
|Voila Benchmark| Evaluation of Voila Benchmark | https://huggingface.co/datasets/maitrix-org/Voila-Benchmark |
|Voila Voice Library| Millons of pre-build voices | https://huggingface.co/datasets/maitrix-org/Voila-million-voice
# 📊 Benchmark
## 1. Voila Benchmark
We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ-Open, and GSM8k. We compare our results with SpeechGPT and Moshi.
| Model | Voila Benchmark |
|-------|----------------|
|SpeechGPT| 13.29|
|Moshi | 11.45 |
|**Voila** | **30.56** |
_(higher is better)_
For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark").
## 2. Evaluation of ASR
As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text-to-Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS.
For ASR, we assess performance on the LibriSpeech test-clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%.
| Model | LibriSpeech test-clean (WER) |
|-------|-----------------------|
|Whisper large v2|2.7|
|Whisper large v3|2.2|
|FastConformer|3.6|
|VoxtLM |2.7|
|Moshi |5.7|
|**Voila (w/o LibriSpeech train split)** |**4.8**|
|**Voila (with LibriSpeech train split)**|**2.7**|
_(lower is better)_
## 3. Evaluation of TTS
For TTS, we follow the evaluation metrics proposed in Vall-E, which involves transcribing the generated audio using HuBERT-Large.
Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data).
| Model | LibriSpeech test-clean (WER) |
|-------|-----------------------|
|YourTTS |7.7|
|Vall-E|5.9|
|Moshi|4.7|
|**Voila (w/o LibriSpeech train split)** |**3.2**|
|**Voila (with LibriSpeech train split)** |**2.8**|
_(lower is better)_
# 📝 Citation
If you find our work helpful, please cite us.
```
@article{voila2025,
author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
title = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay},
eprint={2505.02707},
archivePrefix={arXiv},
primaryClass={cs.CL},
year = {2025}
}
```
<p align="center">
<img src="https://voila.maitrix.org/static/images/logo.png" width="400"/><br/>
<b>Voila: <span style="color:#ca00f9">Voi</span>ce-<span style="color:#ca00f9">La</span>nguage 基础模型</b><br/><br/>
💜 <a href="https://voila.maitrix.org"><b>项目主页</b></a>    |    🖥️ <a href="https://github.com/maitrix-org/Voila">GitHub仓库</a>    |   🤗 <a href="https://huggingface.co/collections/maitrix-org/voila-67e0d96962c19f221fc73fa5">Hugging Face</a>   |    📑 <a href="http://arxiv.org/abs/2505.02707">学术论文</a>    |    🌐 <a href="https://huggingface.co/spaces/maitrix-org/Voila-demo">在线演示</a>   |    🏠<a href="https://maitrix.org">Maitrix.org</a>
</p>
Voila是全新的语音语言基础模型家族,旨在将人机交互体验提升至全新境界。它打破了传统语音AI系统的诸多局限:高延迟、语音细节丢失与机械呆板的回复。Voila采用创新的端到端模型设计与新颖的分层Transformer(Transformer)架构,该方案可实现实时、自主且丰富的语音交互,延迟低至195毫秒,优于人类平均响应时长。结合先进的语音与语言建模技术,Voila支持可定制化、角色驱动的交互,并在涵盖自动语音识别(ASR)、文本转语音(TTS)在内的多种音频任务以及跨6种语言的语音翻译任务中表现卓越。通过官方[在线演示](https://huggingface.co/spaces/maitrix-org/Voila-demo),Voila邀您体验变革性的自然人机对话体验。
# ✨ 亮点
- ⭐ 高保真、低延迟的实时流式音频处理
- ⭐ 语音与语言建模能力的高效融合
- ⭐ 海量预制与自定义语音,对话过程中可快速切换音色
- ⭐ 支持多种音频任务的统一模型
# 🎥 视频演示
[](https://www.youtube.com/watch?v=J27M9-g5KL0)
# 🔥 最新动态!
* 2025年4月28日:👋 我们已发布Voila的推理代码与模型权重。
# ⚙️ 基础模型
| 模型 | 描述 | 下载链接 |
|--------|-----------|-----------------|
|Voila-base|Voila基础模型|https://huggingface.co/maitrix-org/Voila-base|
|Voila-Chat|端到端音频聊天模型|https://huggingface.co/maitrix-org/Voila-chat|
|Voila-Autonomous (预览版)|全双工音频聊天模型|https://huggingface.co/maitrix-org/Voila-autonomous-preview|
|Voila-Audio-alpha|为大语言模型(LLM)提供原始音频输入支持|https://huggingface.co/maitrix-org/Voila-audio-alpha|
|Voila-Tokenizer|音频分词器|https://huggingface.co/maitrix-org/Voila-Tokenizer|
## 使用方法
### CLI演示
shell
for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do
# 文本聊天
python infer.py
--model-name ${model_name}
--instruction ""
--input-text "Hello"
--task-type chat_tito
# 语音聊天
python infer.py
--model-name ${model_name}
--instruction ""
--input-audio "examples/test1.mp3"
--task-type chat_aiao
done
# 自主模式
python infer.py
--model-name "maitrix-org/Voila-autonomous-preview"
--instruction ""
--input-audio "examples/test_autonomous1.mp3"
--task-type chat_aiao_auto
### Gradio演示
shell
python gradio_demo.py
更多信息请参阅[代码仓库](https://github.com/maitrix-org/Voila)。
# 📁 数据集
我们发布了以下两个数据集:Voila基准测试集与Voila语音库。Voila-Benchmark是全新的语音评估基准,而Voila语音库则提供海量预制与可自定义的语音资源。
| 数据集 | 描述 | 下载链接 |
|--------|-----------|-----------------|
|Voila Benchmark|Voila基准测试集| https://huggingface.co/datasets/maitrix-org/Voila-Benchmark |
|Voila Voice Library|海量预制语音库| https://huggingface.co/datasets/maitrix-org/Voila-million-voice
# 📊 基准测试
## 1. Voila基准测试集
我们推出了全新的语音评估基准——VoilaBenchmark。该基准数据集从5个广泛使用的语言模型评估数据集采样构建:MMLU、MATH、OpenAI HumanEval、NQ-Open与GSM8k。我们将Voila的表现与SpeechGPT和Moshi进行了对比。
| 模型 | Voila基准测试集得分 |
|-------|----------------|
|SpeechGPT| 13.29|
|Moshi | 11.45 |
|**Voila** | **30.56** |
*(得分越高越好)*
如需了解Voila基准测试集各细分领域的详细得分,请参阅我们的论文(第5.1节“Voila基准测试集评估”)。
## 2. 自动语音识别(ASR)评估
由于Voila支持多项任务,包括自动语音识别(ASR)、文本转语音(TTS)与语音问答,我们同样对ASR与TTS的性能进行了评估。
在ASR任务中,我们在LibriSpeech测试集clean子集上进行性能评估,以词错误率(WER)作为评价指标。Voila的词错误率达到4.8%,优于Moshi报告的5.7%。当两个模型均使用LibriSpeech训练数据时,Voila的词错误率可达2.7%,表现亮眼。
| 模型 | LibriSpeech test-clean 词错误率(WER) |
|-------|-----------------------|
|Whisper large v2|2.7|
|Whisper large v3|2.2|
|FastConformer|3.6|
|VoxtLM |2.7|
|Moshi |5.7|
|**Voila(未使用LibriSpeech训练子集)** |**4.8**|
|**Voila(使用LibriSpeech训练子集)**|**2.7**|
*(错误率越低越好)*
## 3. 文本转语音(TTS)评估
在TTS任务中,我们遵循Vall-E提出的评估标准,即使用HuBERT-Large对生成的音频进行转录。
Voila再次以3.2%的WER得分领跑(使用LibriSpeech训练数据时可达2.8%)。
| 模型 | LibriSpeech test-clean 词错误率(WER) |
|-------|-----------------------|
|YourTTS |7.7|
|Vall-E|5.9|
|Moshi|4.7|
|**Voila(未使用LibriSpeech训练子集)** |**3.2**|
|**Voila(使用LibriSpeech训练子集)** |**2.8**|
*(错误率越低越好)*
# 📝 引用
如果您认为我们的工作对您有所帮助,请引用我们的研究。
@article{voila2025,
author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu},
title = {Voila: 面向实时自主交互与语音角色扮演的语音语言基础模型},
eprint={2505.02707},
archivePrefix={arXiv},
primaryClass={cs.CL},
year = {2025}
}
提供机构:
maas
创建时间:
2025-05-08



