five

echodict/NeMo

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/echodict/NeMo
下载链接
链接失效反馈
官方服务:
资源简介:
[![Project Status: Active -- The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active) [![Documentation](https://readthedocs.com/projects/nvidia-nemo/badge/?version=main)](https://docs.nvidia.com/nemo/speech/nightly/starthere/intro.html) [![CodeQL](https://github.com/nvidia/nemo/actions/workflows/codeql.yml/badge.svg?branch=main&event=push)](https://github.com/nvidia/nemo/actions/workflows/codeql.yml) [![NeMo core license and license for collections in this repo](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://github.com/NVIDIA/NeMo/blob/master/LICENSE) [![Release version](https://badge.fury.io/py/nemo-toolkit.svg)](https://badge.fury.io/py/nemo-toolkit) [![Python version](https://img.shields.io/pypi/pyversions/nemo-toolkit.svg)](https://badge.fury.io/py/nemo-toolkit) [![PyPi total downloads](https://static.pepy.tech/personalized-badge/nemo-toolkit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads)](https://pepy.tech/project/nemo-toolkit) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) # **NVIDIA NeMo Speech** Checkout our [HuggingFace🤗 collection](https://huggingface.co/collections/nvidia/nemotron-speech) for the latest open weight checkpoints and demos! ## Updates - 2026-03: [Nemotron 3 VoiceChat](https://build.nvidia.com/nvidia/nemotron-voicechat/modelcard) is now released in Early Access. Built on the Nemotron Nano v2 LLM backbone with Nemotron speech and TTS decoder, VoiceChat delivers full-duplex, natural, interruptible conversations with low latency. Try out [the demo](https://build.nvidia.com/nvidia/nemotron-voicechat) and apply for [early access](https://developer.nvidia.com/nemotron-voicechat-early-access). - 2026-03: [Nemotron-Speech-Streaming v2603](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) has been updated. It has been trained on a larger and more diverse corpus, resulting in lower WER across all latency modes. Try out [the demo](https://huggingface.co/spaces/nvidia/nemotron-speech-streaming-en-0.6b) and check out [the NIM](https://build.nvidia.com/nvidia/nemotron-asr-streaming). - 2026-03: [MagpieTTS v2602](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) has been released with support for 9 languages(En, Es, De, Fr, Vi, It, Zh, Hi, Ja). Try out [the demo](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) and check out [the NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual). - 2026-01: Nemotron-Speech-Streaming was released: One checkpoint that enables users to pick their optimal point on the latency-accuracy Pareto curve! - 2026-01: MagpieTTS was released. - 2026: This repo has pivoted to focus on audio, speech, and multimodal LLM. For the last NeMo release with support for more modalities, see [v2.7.0](https://github.com/NVIDIA-NeMo/NeMo/releases/tag/v2.7.0) - 2025-08: [Parakeet V3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) and [Canary V2](https://huggingface.co/nvidia/canary-1b-v2) have been released with speech recognition and translation support for 25 European languages. - 2025-06: [Canary-Qwen-2.5B](https://huggingface.co/nvidia/canary-qwen-2.5b) has been released with record-setting 5.63% WER on English Open ASR Leaderboard. ## Introduction NVIDIA NeMo Speech is built for researchers and PyTorch developers working on Speech models including Automatic Speech Recognition (ASR), Text to Speech (TTS), and Speech LLMs. It is designed to help you efficiently create, customize, and deploy new It is designed to help you efficiently create, customize, and deploy new AI models by leveraging existing code and pre-trained model checkpoints. For technical documentation, please see the [NeMo Framework User Guide](https://docs.nvidia.com/nemo/speech/nightly/starthere/intro.html). ## Requirements - Python 3.12 or above - Pytorch 2.6 or above - NVIDIA GPU (if you intend to do model training) As of [Pytorch 2.6](https://docs.pytorch.org/docs/stable/notes/serialization.html#torch-load-with-weights-only-true), `torch.load` defaults to using `weights_only=True`. Some model checkpoints may require using `weights_only=False`. In this case, you can set the env var `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1` before running code that uses `torch.load`. However, this should only be done with trusted files. Loading files from untrusted sources with more than weights only can have the risk of arbitrary code execution. ## Developer Documentation | Version | Status | Description | | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | | Latest | [![Documentation Status](https://readthedocs.com/projects/nvidia-nemo/badge/?version=main)](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/) | [Documentation of the latest (i.e. main) branch.](https://docs.nvidia.com/nemo/speech/nightly/starthere/intro.html) | | Stable | [![Documentation Status](https://readthedocs.com/projects/nvidia-nemo/badge/?version=stable)](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/) | Documentation of the stable (i.e. most recent release) - To be added | ## Install NeMo Speech NeMo Speech is installable via pip: `pip install 'nemo-toolkit[all]'` To install with extra dependencies for CUDA 12.x or 13.x, use `pip install 'nemo-toolkit[all,cu12]'` or `pip install 'nemo-toolkit[all,cu13]'` respectively. ## Contribute to NeMo We welcome community contributions! Please refer to [CONTRIBUTING.md](https://github.com/NVIDIA-NeMo/NeMo/blob/main/CONTRIBUTING.md) for the process. ## Licenses NeMo is licensed under the [Apache License 2.0](https://github.com/NVIDIA/NeMo?tab=Apache-2.0-1-ov-file).
提供机构:
echodict
搜集汇总
数据集介绍
main_image_url
构建方式
NVIDIA NeMo Speech 是一个专为语音模型研究与开发设计的开源工具包,由 NVIDIA 构建并维护。其构建方式依托于 PyTorch 深度学习框架,并集成了大量预训练模型检查点与可复用的代码模块,旨在简化自动语音识别(ASR)、文本转语音(TTS)以及语音大语言模型等任务的开发流程。工具包通过模块化架构实现灵活组合,支持研究人员和开发者基于现有资源高效定制与部署新模型。此外,NeMo Speech 持续更新,吸纳最新的技术进展,如 Nemotron 语音流式处理与多语言 TTS 模型,体现了其在语音 AI 领域的演进与生态整合能力。
使用方法
使用 NeMo Speech 时,用户需确保环境满足 Python 3.12 及以上版本、PyTorch 2.6 及以上版本,并配备 NVIDIA GPU 以进行模型训练。安装过程简洁,通过执行 'pip install nemo-toolkit[all]' 即可获取完整功能,如需特定 CUDA 版本支持,还可选择 'cu12' 或 'cu13' 附加包。工具包提供了详尽的开发者文档与社区贡献指南,用户可参考官方用户手册进行模型定制、训练与推理。对于预训练模型,可从 HuggingFace 集合中下载检查点并直接使用,同时注意在加载文件时遵循安全性建议,避免从不可信来源引入代码执行风险。
背景与挑战
背景概述
NVIDIA NeMo 语音数据集由 NVIDIA 于 2024 年创制,旨在为自动语音识别(ASR)、文本转语音(TTS)及语音大语言模型(Speech LLMs)的研究者与 PyTorch 开发者提供开源工具。其核心研究问题聚焦于构建高效、可定制的语音 AI 模型,并推动全双工、低延迟的语音交互技术发展,如 Nemotron 3 VoiceChat 和 Nemotron-Speech-Streaming 等前沿模型。NeMo 不仅支持多语言扩展(如 MagpieTTS 覆盖 9 种语言),还通过可调整的延迟-准确率帕累托曲线(Latency-Accuracy Pareto Curve),为实时语音应用提供了灵活部署方案。该数据集依托 NVIDIA 的算力优势,在语音领域树立了开源工具的标杆,显著加速了语音 AI 的研究与产业化进程。
当前挑战
NeMo 所解决的领域挑战包括:1)传统 ASR 系统在跨语言、低资源场景下的鲁棒性不足,通过扩展至 25 种欧洲语言及多语种 TTS 支持,提升了模型的泛化能力;2)语音交互中的延迟与准确率难以平衡,NeMo 通过流式模型(如 Nemotron-Speech-Streaming)实现了实时对话,同时保持低词错误率(WER)。构建过程中面临的挑战有:1)大型多模态数据集的清洗与标注,需确保训练语料的多样性与高质量;2)模型安全性的保障,如对 torch.load 中恶意代码执行风险的防范,要求开发者谨慎处理未信任的权重文件;3)与不同 CUDA 版本(如 12.x/13.x)的兼容性维护,以适配多样化的硬件环境。
常用场景
经典使用场景
NeMo 数据集的核心经典应用在于为自动语音识别(ASR)、文本到语音(TTS)以及语音大语言模型(Speech LLMs)提供了完整的研发框架。研究人员利用其丰富的预训练模型检查点与模块化代码库,高效构建和定制语音AI系统,尤其擅长在复杂的多语言、多场景语音数据上进行训练与微调,成为语音技术探索的基石工具。
解决学术问题
该数据集解决了语音AI领域长期面临的资源碎片化与复现困难问题。通过统一封装ASR、TTS、语音翻译等核心任务的实现,它显著降低了研究门槛,使得学者能够聚焦于模型架构创新与跨语言泛化能力提升。其开箱即用的特性助力学术社区在低资源语言建模、噪声鲁棒性等挑战上取得突破,推动了语音技术的理论深化。
实际应用
在实际产业应用中,NeMo 赋能了众多高价值场景,例如智能客服的全双工对话、实时语音转录服务、以及多语言有声内容生成。企业依托其高性能模型实现了从语音到文本的精准转换与高自然度语音合成,广泛应用于虚拟助手、无障碍通信、教育平台和娱乐产业,极大提升了人机交互的流畅度与包容性。
数据集最近研究
最新研究方向
NeMo数据集的最新研究方向聚焦于全双工语音交互与多模态大语言模型的深度融合。通过推出Nemotron 3 VoiceChat实现低延迟、可打断的自然对话能力,结合Nemotron-Speech-Streaming在延迟-准确率帕累托曲线上的优化突破,以及MagpieTTS对九种语言的高质量合成支持,该平台正引领语音领域从单一的识别与合成走向端到端、多语种、实时交互的前沿。同时,Canary-Qwen-2.5B在英文开放ASR榜单创下5.63%词错误率的新纪录,展现了模型在语音识别任务上的强大精度,进一步夯实了NVIDIA在语音AI技术创新与工业落地中的核心地位。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作