five

Voila-Benchmark

收藏
魔搭社区2025-11-27 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/maitrix-org/Voila-Benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="center"> <img src="https://voila.maitrix.org/static/images/logo.png" width="400"/><br/> <b>Voila: <span style="color:#ca00f9">Voi</span>ce-<span style="color:#ca00f9">La</span>nguage Foundation Models</b><br/><br/> 💜 <a href="https://voila.maitrix.org"><b>Project Page</b></a> &nbsp&nbsp | &nbsp&nbsp 🖥️ <a href="https://github.com/maitrix-org/Voila">GitHub</a> &nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/maitrix-org/voila-67e0d96962c19f221fc73fa5">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="http://arxiv.org/abs/2505.02707">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🌐 <a href="https://huggingface.co/spaces/maitrix-org/Voila-demo">Online Demo</a> &nbsp&nbsp| &nbsp&nbsp 🏠<a href="https://maitrix.org">Maitrix.org</a> </p> Voila is a new family of large voice-language foundation models aiming to lift human-AI interaction experiences to the next level. Breaking away from the constraints of traditional voice AI systems—high latency, loss of vocal nuances, and mechanical responses—Voila employs an innovative end-to-end model design and a novel hierarchical Transformer architecture. This approach enables real-time, autonomous, and rich voice interactions, with latency as low as 195 ms, surpassing average human response times. Combining advanced voice and language modeling, Voila offers customizable, persona-driven engagements and excels in a range of audio tasks from ASR and TTS to speech translation across six languages. With the online [web demo](https://huggingface.co/spaces/maitrix-org/Voila-demo), Voila invites you to explore a transformative, natural dialogue experience between human and AI. # ✨ Highlights - ⭐ High-fidelity, low-latency, real-time streaming audio processing - ⭐ Effective integration of voice and language modeling capabilities - ⭐ Millions of pre-built and custom voices, fast voice switching during conversation - ⭐ Unified model for various audio tasks # 🎥 Video Demo [![Voila Demo](https://img.youtube.com/vi/J27M9-g5KL0/0.jpg)](https://www.youtube.com/watch?v=J27M9-g5KL0) # 🔥 Latest News!! * April 28, 2025: 👋 We've released the inference code and model weights of Voila. # ⚙️ Foundation Models | Model | Description | Download Link | |--------|-----------|-----------------| |Voila-base|Voila base model|https://huggingface.co/maitrix-org/Voila-base| |Voila-Chat|End-to-end audio chat model|https://huggingface.co/maitrix-org/Voila-chat| |Voila-Autonomous (preview)|Full-duplex audio chat model|https://huggingface.co/maitrix-org/Voila-autonomous-preview| |Voila-Audio-alpha|Empowering LLM with raw audio input|https://huggingface.co/maitrix-org/Voila-audio-alpha| |Voila-Tokenizer|Audio tokenizer|https://huggingface.co/maitrix-org/Voila-Tokenizer| ## Usage ### CLI demo ```shell for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do # Text chat python infer.py \ --model-name ${model_name} \ --instruction "" \ --input-text "Hello" \ --task-type chat_tito # Voice chat python infer.py \ --model-name ${model_name} \ --instruction "" \ --input-audio "examples/test1.mp3" \ --task-type chat_aiao done # Autonomous mode python infer.py \ --model-name "maitrix-org/Voila-autonomous-preview" \ --instruction "" \ --input-audio "examples/test_autonomous1.mp3" \ --task-type chat_aiao_auto ``` ### Gradio demo ```shell python gradio_demo.py ``` For more information, please refer to the [code repository](https://github.com/maitrix-org/Voila). # 📁 Datasets We publish the following two datasets: Voila Benchmark and Voila Voice Library. Voila-Benchmark is a novel speech evaluation benchmark, while Voila Voice Library provides millions of pre-built and customizable voices. | Dataset | Description | Download Link | |--------|-----------|-----------------| |Voila Benchmark| Evaluation of Voila Benchmark | https://huggingface.co/datasets/maitrix-org/Voila-Benchmark | |Voila Voice Library| Millons of pre-build voices | https://huggingface.co/datasets/maitrix-org/Voila-million-voice # 📊 Benchmark ## 1. Voila Benchmark We introduce a novel speech evaluation benchmark called the VoilaBenchmark. The Voila Benchmark is constructed by sampling from five widely used language model evaluation datasets: MMLU, MATH, OpenAI HumanEval, NQ-Open, and GSM8k. We compare our results with SpeechGPT and Moshi. | Model | Voila Benchmark | |-------|----------------| |SpeechGPT| 13.29| |Moshi | 11.45 | |**Voila** | **30.56** | _(higher is better)_ For detailed scores of Voila Benchmark on each specific domain, please refer to our paper (Section 5.1 "Evaluation of Voila Benchmark"). ## 2. Evaluation of ASR As Voila supports multiple tasks, including Automatic Speech Recognition (ASR), Text-to-Speech(TTS), and spoken question answering, we also evaluate the performance of ASR and TTS. For ASR, we assess performance on the LibriSpeech test-clean dataset, using Word Error Rate (WER) as our metric. Voila attains a word error rate (WER) of 4.8%, outperforming the 5.7% reported by Moshi. In scenarios where both models utilize LibriSpeech training data, Voila achieves an impressive WER of 2.7%. | Model | LibriSpeech test-clean (WER) | |-------|-----------------------| |Whisper large v2|2.7| |Whisper large v3|2.2| |FastConformer|3.6| |VoxtLM |2.7| |Moshi |5.7| |**Voila (w/o LibriSpeech train split)** |**4.8**| |**Voila (with LibriSpeech train split)**|**2.7**| _(lower is better)_ ## 3. Evaluation of TTS For TTS, we follow the evaluation metrics proposed in Vall-E, which involves transcribing the generated audio using HuBERT-Large. Voila once again leads with a WER of 3.2% (and 2.8% when using LibriSpeech training data). | Model | LibriSpeech test-clean (WER) | |-------|-----------------------| |YourTTS |7.7| |Vall-E|5.9| |Moshi|4.7| |**Voila (w/o LibriSpeech train split)** |**3.2**| |**Voila (with LibriSpeech train split)** |**2.8**| _(lower is better)_ # 📝 Citation If you find our work helpful, please cite us. ``` @article{voila2025, author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu}, title = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay}, eprint={2505.02707}, archivePrefix={arXiv}, primaryClass={cs.CL}, year = {2025} } ```

<p align="center"> <img src="https://voila.maitrix.org/static/images/logo.png" width="400"/><br/> <b>Voila: <span style="color:#ca00f9">语音</span>-<span style="color:#ca00f9">语言</span>基础模型(Voice-Language Foundation Models)</b><br/><br/> 💜 <a href="https://voila.maitrix.org"><b>项目主页</b></a> &nbsp&nbsp | &nbsp&nbsp 🖥️ <a href="https://github.com/maitrix-org/Voila">GitHub</a> &nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/maitrix-org/voila-67e0d96962c19f221fc73fa5">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="http://arxiv.org/abs/2505.02707">论文</a> &nbsp&nbsp | &nbsp&nbsp 🌐 <a href="https://huggingface.co/spaces/maitrix-org/Voila-demo">在线演示</a> &nbsp&nbsp| &nbsp&nbsp 🏠<a href="https://maitrix.org">Maitrix.org</a> </p> Voila是全新的语音语言基础模型家族,旨在将人机交互体验提升至全新境界。它打破了传统语音AI系统的诸多局限——高延迟、语音细节丢失以及机械呆板的回复——Voila采用了创新的端到端模型设计与全新的分层Transformer(Transformer)架构。该方案支持实时、自主且丰富的语音交互,延迟低至195毫秒,优于人类平均响应时长。结合先进的语音与语言建模技术,Voila提供可定制化、基于角色的交互体验,并在多项音频任务中表现优异,涵盖自动语音识别(Automatic Speech Recognition, ASR)、文本转语音(Text-to-Speech, TTS)以及六种语言的语音翻译等任务。借助在线[网页演示](https://huggingface.co/spaces/maitrix-org/Voila-demo),Voila邀您体验变革性的自然人机对话体验。 # ✨ 核心亮点 - ⭐ 高保真、低延迟的实时流式音频处理 - ⭐ 语音与语言建模能力的高效融合 - ⭐ 数百万预置与自定义语音,对话中可快速切换音色 - ⭐ 支持多种音频任务的统一模型 # 🎥 视频演示 [![Voila Demo](https://img.youtube.com/vi/J27M9-g5KL0/0.jpg)](https://www.youtube.com/watch?v=J27M9-g5KL0) # 🔥 最新动态!! * 2025年4月28日:👋 我们已发布Voila的推理代码与模型权重。 # ⚙️ 基础模型 | 模型 | 描述 | 下载链接 | |--------|-----------|-----------------| |Voila-base|Voila基础模型|https://huggingface.co/maitrix-org/Voila-base| |Voila-Chat|端到端音频聊天模型|https://huggingface.co/maitrix-org/Voila-chat| |Voila-Autonomous (预览版)|全双工音频聊天模型|https://huggingface.co/maitrix-org/Voila-autonomous-preview| |Voila-Audio-alpha|为大语言模型(Large Language Model, LLM)赋能原生音频输入能力|https://huggingface.co/maitrix-org/Voila-audio-alpha| |Voila-Tokenizer|音频Token(Token)器|https://huggingface.co/maitrix-org/Voila-Tokenizer| ## 使用方法 ### CLI 演示 shell for model_name in "maitrix-org/Voila-audio-alpha" "maitrix-org/Voila-base" "maitrix-org/Voila-chat"; do # 文本对话 python infer.py --model-name ${model_name} --instruction "" --input-text "Hello" --task-type chat_tito # 语音对话 python infer.py --model-name ${model_name} --instruction "" --input-audio "examples/test1.mp3" --task-type chat_aiao done # 自主交互模式 python infer.py --model-name "maitrix-org/Voila-autonomous-preview" --instruction "" --input-audio "examples/test_autonomous1.mp3" --task-type chat_aiao_auto ### Gradio 演示 shell python gradio_demo.py 如需更多信息,请参阅[代码仓库](https://github.com/maitrix-org/Voila)。 # 📁 数据集 我们发布了以下两个数据集:Voila基准测试集(Voila Benchmark)与Voila语音库(Voila Voice Library)。Voila基准测试集是全新的语音评估基准,而Voila语音库则提供数百万条预置且可自定义的语音样本。 | 数据集 | 描述 | 下载链接 | |--------|-----------|-----------------| |Voila基准测试集| Voila基准测试集的评估任务 | https://huggingface.co/datasets/maitrix-org/Voila-Benchmark | |Voila语音库| 数百万条预置语音 | https://huggingface.co/datasets/maitrix-org/Voila-million-voice| # 📊 基准测试 ## 1. Voila基准测试集 我们推出了全新的语音评估基准——Voila基准测试集。该基准从五个广泛使用的语言模型评估数据集采样构建:MMLU、MATH、OpenAI HumanEval、NQ-Open以及GSM8k。我们将Voila的表现与SpeechGPT和Moshi进行了对比。 | 模型 | Voila基准测试集得分 | |-------|----------------| |SpeechGPT| 13.29| |Moshi | 11.45 | |**Voila** | **30.56** | _(得分越高,性能越好)_ 如需了解Voila基准测试集各细分领域的详细得分,请参阅我们的论文(5.1节“Voila基准测试集评估”)。 ## 2. 自动语音识别(ASR)评估 由于Voila支持多项任务,包括自动语音识别(Automatic Speech Recognition, ASR)、文本转语音(Text-to-Speech, TTS)以及语音问答,我们同样对ASR与TTS的性能进行了评估。 在ASR任务中,我们在LibriSpeech测试清洁集(LibriSpeech test-clean)上评估性能,以词错误率(Word Error Rate, WER)作为评价指标。Voila的词错误率为4.8%,优于Moshi报告的5.7%。当两个模型均使用LibriSpeech训练数据时,Voila的词错误率可达惊人的2.7%。 | 模型 | LibriSpeech测试清洁集词错误率(WER) | |-------|-----------------------| |Whisper large v2|2.7| |Whisper large v3|2.2| |FastConformer|3.6| |VoxtLM |2.7| |Moshi |5.7| |**Voila(未使用LibriSpeech训练子集)** |**4.8**| |**Voila(使用LibriSpeech训练子集)**|**2.7**| _(词错误率越低,性能越好)_ ## 3. 文本转语音(TTS)评估 在TTS任务中,我们遵循Vall-E提出的评估指标,即使用HuBERT-Large对生成的音频进行转录。 Voila再次以3.2%的词错误率领跑(使用LibriSpeech训练数据时可达2.8%)。 | 模型 | LibriSpeech测试清洁集词错误率(WER) | |-------|-----------------------| |YourTTS |7.7| |Vall-E|5.9| |Moshi|4.7| |**Voila(未使用LibriSpeech训练子集)** |**3.2**| |**Voila(使用LibriSpeech训练子集)** |**2.8**| _(词错误率越低,性能越好)_ # 📝 引用 如果您认为我们的工作对您有所帮助,请引用我们的论文。 @article{voila2025, author = {Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu}, title = {Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Roleplay}, eprint={2505.02707}, archivePrefix={arXiv}, primaryClass={cs.CL}, year = {2025} }
提供机构:
maas
创建时间:
2025-05-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作