LJSpeech-1.1-48kHz高清语音合成数据集

Name: LJSpeech-1.1-48kHz高清语音合成数据集
Creator: maas
Published: 2026-05-21 19:45:20
License: 暂无描述

魔搭社区2026-05-21 更新2025-01-18 收录

下载链接：

https://modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz

下载链接

链接失效反馈

官方服务：

资源简介：

LJSpeech-1.1 数据集因其在语音合成（TTS）和其他语音处理任务中的广泛应用而广为人知。现在，通过先进的语音超分辨率算法，这一数据集得到了进一步增强。原始数据集的采样率为 22,050 Hz，现在使用 ClearerVoice-Studio 工具将其升级至 48,000 Hz，为高级音频处理任务提供高保真的音频版本。 **下载方法** - SDK下载 ```python #验证SDK token from modelscope.hub.api import HubApi api = HubApi() api.login('fadd1abb-4df6-4807-9051-5ab01ac81071') #数据集下载 from modelscope.msdatasets import MsDataset ds = MsDataset.load('iic/LJSpeech-1.1-48kHz') #您可按需配置 subset_name、split，参照“快速使用”示例代码 ``` - GIT Clone, 请确保 lfs 已经被正确安装 ```sh git lfs install git clone https://oauth2:JcynSwnM9dvj1HatM2Po@www.modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz.git ``` - Huggingface 下载地址：(https://huggingface.co/datasets/alibabasglab/LJSpeech-1.1-48kHz) **主要特点** - 高分辨率音频：数据集现以 48,000 Hz 的采样率提供音频文件，增强了感知质量，具有更丰富的高频细节。 - 原始内容完整性：保留了原始语言内容和注释结构，确保与现有工作流的兼容性。 - 更广的应用范围：适用于专业级音频合成、TTS 系统以及其他高质量音频应用。 - 开源：免费提供用于学术和研究目的，促进语音和音频领域的创新。 **原始数据集** - 来源：原始 LJSpeech-1.1 数据集包含 13,100 条单一女性讲话者朗读公共领域书籍片段的音频片段。 - 时长：约 24 小时的语音数据。 - 注释：每条音频片段均配有相应的文本转录。 **超分辨率处理** 原始 22,050 Hz 音频通过最先进的基于 MossFormer2 的语音超分辨率模型进行处理。该模型采用以下技术： - 高级神经架构：结合了基于 Transformer 的序列建模和HiFi-GAN卷积生成网络。 - 感知优化：采用专为保持语音自然性和清晰度设计的损失函数。 - 高频重建：算法专门针对恢复丢失的高频成分进行优化，确保平滑且无伪影的增强效果。 **输出格式** - 采样率：48,000 Hz - 音频格式：WAV - 比特深度：16 位 - 声道配置：单声道 **使用场景** - 文本到语音（TTS）合成 - 训练高保真 TTS 系统：生成更加逼真的语音输出。 - 支持情感化表达：实现更具情感与表现力的 TTS 合成。 - 语音超分辨率基准测试 - 作为超分辨率算法的参考数据集：用于评估语音超分辨率模型的性能。 - 提供感知质量的标准化基准：助力提升语音处理技术。 - 音频增强与修复 - 修复低分辨率或退化的语音信号：满足专业应用需求。 - 创建高质量配音和旁白：用于多媒体项目制作。 **文件结构** 数据集保留了原始 LJSpeech-1.1 的目录结构，便于使用： LJSpeech-1.1-48kHz/ ├── metadata.csv # 文本转录和音频文件映射 ├── wavs/ # 包含 48,000 Hz WAV 文件的目录 └── LICENSE.txt # 许可信息 **许可协议** LJSpeech-1.1 高分辨率数据集根据原始 LJSpeech-1.1 数据集的开源许可发布，用户可以免费使用、修改和共享该数据集用于学术和非商业用途，前提是必须给予适当的署名。

The LJSpeech-1.1 dataset is widely recognized for its extensive applications in text-to-speech (TTS) synthesis and other speech processing tasks. It has now been further enhanced via advanced speech super-resolution algorithms. The original dataset had a sampling rate of 22,050 Hz, and it has been upsampled to 48,000 Hz using the ClearerVoice-Studio tool, providing high-fidelity audio versions for advanced audio processing tasks. **Download Methods** - SDK Download python # Validate SDK token from modelscope.hub.api import HubApi api = HubApi() api.login('fadd1abb-4df6-4807-9051-5ab01ac81071') # Dataset download from modelscope.msdatasets import MsDataset ds = MsDataset.load('iic/LJSpeech-1.1-48kHz') # You can configure subset_name and split as needed, refer to the "Quick Usage" example code - Git Clone, please ensure LFS is properly installed sh git lfs install git clone https://oauth2:JcynSwnM9dvj1HatM2Po@www.modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz.git - Huggingface Download Link: https://huggingface.co/datasets/alibabasglab/LJSpeech-1.1-48kHz **Key Features** - High-Resolution Audio: The dataset now provides audio files at a sampling rate of 48,000 Hz, enhancing perceptual quality with richer high-frequency details. - Original Content Integrity: Retains the original linguistic content and annotation structure, ensuring compatibility with existing workflows. - Broader Application Scope: Suitable for professional-grade audio synthesis, TTS systems, and other high-quality audio applications. - Open Source: Freely available for academic and research purposes, facilitating innovation in the speech and audio domains. **Original Dataset** - Source: The original LJSpeech-1.1 dataset contains 13,100 audio clips of a single female speaker reading excerpts from public-domain books. - Duration: Approximately 24 hours of speech data. - Annotations: Each audio clip is paired with a corresponding text transcription. **Super-Resolution Processing** The original 22,050 Hz audio is processed using a state-of-the-art MossFormer2-based speech super-resolution model. This model incorporates the following technologies: - Advanced Neural Architecture: Combines Transformer-based sequence modeling and HiFi-GAN convolutional generative networks. - Perceptual Optimization: Adopts a loss function specifically designed to preserve the naturalness and clarity of speech. - High-Frequency Reconstruction: The algorithm is specifically optimized to recover lost high-frequency components, ensuring smooth and artifact-free enhanced results. **Output Format** - Sampling Rate: 48,000 Hz - Audio Format: WAV - Bit Depth: 16-bit - Channel Configuration: Mono **Application Scenarios** - Text-to-Speech (TTS) Synthesis - Training High-Fidelity TTS Systems: Generates more realistic speech outputs. - Supporting Emotional Expression: Enables TTS synthesis with greater emotion and expressiveness. - Speech Super-Resolution Benchmarking - Serving as a Reference Dataset for Super-Resolution Algorithms: Used to evaluate the performance of speech super-resolution models. - Providing a Standardized Benchmark for Perceptual Quality: Helping advance speech processing technologies. - Audio Enhancement and Restoration - Restoring Low-Resolution or Degraded Speech Signals: Meeting the requirements of professional applications. - Creating High-Quality Dubbing and Narration: Used for multimedia project production. **File Structure** The dataset retains the directory structure of the original LJSpeech-1.1 for ease of use: LJSpeech-1.1-48kHz/ ├── metadata.csv # Text transcription and audio file mapping ├── wavs/ # Directory containing 48,000 Hz WAV files └── LICENSE.txt # License information **License Agreement** The LJSpeech-1.1 high-resolution dataset is released under the open-source license of the original LJSpeech-1.1 dataset. Users may freely use, modify, and share the dataset for academic and non-commercial purposes, provided that appropriate attribution is given.

提供机构：

maas

创建时间：

2025-01-14

搜集汇总

数据集介绍