LJSpeech-1.1-48kHz高清语音合成数据集
收藏魔搭社区2026-05-21 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz
下载链接
链接失效反馈官方服务:
资源简介:
LJSpeech-1.1 数据集因其在语音合成(TTS)和其他语音处理任务中的广泛应用而广为人知。现在,通过先进的语音超分辨率算法,这一数据集得到了进一步增强。
原始数据集的采样率为 22,050 Hz,现在使用 ClearerVoice-Studio 工具将其升级至 48,000 Hz,为高级音频处理任务提供高保真的音频版本。
**下载方法**
- SDK下载
```python
#验证SDK token
from modelscope.hub.api import HubApi
api = HubApi()
api.login('fadd1abb-4df6-4807-9051-5ab01ac81071')
#数据集下载
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('iic/LJSpeech-1.1-48kHz')
#您可按需配置 subset_name、split,参照“快速使用”示例代码
```
- GIT Clone, 请确保 lfs 已经被正确安装
```sh
git lfs install
git clone https://oauth2:JcynSwnM9dvj1HatM2Po@www.modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz.git
```
- Huggingface 下载地址:(https://huggingface.co/datasets/alibabasglab/LJSpeech-1.1-48kHz)
**主要特点**
- 高分辨率音频:数据集现以 48,000 Hz 的采样率提供音频文件,增强了感知质量,具有更丰富的高频细节。
- 原始内容完整性:保留了原始语言内容和注释结构,确保与现有工作流的兼容性。
- 更广的应用范围:适用于专业级音频合成、TTS 系统以及其他高质量音频应用。
- 开源:免费提供用于学术和研究目的,促进语音和音频领域的创新。
**原始数据集**
- 来源:原始 LJSpeech-1.1 数据集包含 13,100 条单一女性讲话者朗读公共领域书籍片段的音频片段。
- 时长:约 24 小时的语音数据。
- 注释:每条音频片段均配有相应的文本转录。
**超分辨率处理**
原始 22,050 Hz 音频通过最先进的基于 MossFormer2 的语音超分辨率模型进行处理。该模型采用以下技术:
- 高级神经架构:结合了基于 Transformer 的序列建模和HiFi-GAN卷积生成网络。
- 感知优化:采用专为保持语音自然性和清晰度设计的损失函数。
- 高频重建:算法专门针对恢复丢失的高频成分进行优化,确保平滑且无伪影的增强效果。
**输出格式**
- 采样率:48,000 Hz
- 音频格式:WAV
- 比特深度:16 位
- 声道配置:单声道
**使用场景**
- 文本到语音(TTS)合成
- 训练高保真 TTS 系统:生成更加逼真的语音输出。
- 支持情感化表达:实现更具情感与表现力的 TTS 合成。
- 语音超分辨率基准测试
- 作为超分辨率算法的参考数据集:用于评估语音超分辨率模型的性能。
- 提供感知质量的标准化基准:助力提升语音处理技术。
- 音频增强与修复
- 修复低分辨率或退化的语音信号:满足专业应用需求。
- 创建高质量配音和旁白:用于多媒体项目制作。
**文件结构**
数据集保留了原始 LJSpeech-1.1 的目录结构,便于使用:
LJSpeech-1.1-48kHz/
├── metadata.csv # 文本转录和音频文件映射
├── wavs/ # 包含 48,000 Hz WAV 文件的目录
└── LICENSE.txt # 许可信息
**许可协议**
LJSpeech-1.1 高分辨率数据集根据原始 LJSpeech-1.1 数据集的开源许可发布,用户可以免费使用、修改和共享该数据集用于学术和非商业用途,前提是必须给予适当的署名。
The LJSpeech-1.1 dataset is widely recognized for its extensive applications in text-to-speech (TTS) synthesis and other speech processing tasks. It has now been further enhanced via advanced speech super-resolution algorithms.
The original dataset had a sampling rate of 22,050 Hz, and it has been upsampled to 48,000 Hz using the ClearerVoice-Studio tool, providing high-fidelity audio versions for advanced audio processing tasks.
**Download Methods**
- SDK Download
python
# Validate SDK token
from modelscope.hub.api import HubApi
api = HubApi()
api.login('fadd1abb-4df6-4807-9051-5ab01ac81071')
# Dataset download
from modelscope.msdatasets import MsDataset
ds = MsDataset.load('iic/LJSpeech-1.1-48kHz')
# You can configure subset_name and split as needed, refer to the "Quick Usage" example code
- Git Clone, please ensure LFS is properly installed
sh
git lfs install
git clone https://oauth2:JcynSwnM9dvj1HatM2Po@www.modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz.git
- Huggingface Download Link: https://huggingface.co/datasets/alibabasglab/LJSpeech-1.1-48kHz
**Key Features**
- High-Resolution Audio: The dataset now provides audio files at a sampling rate of 48,000 Hz, enhancing perceptual quality with richer high-frequency details.
- Original Content Integrity: Retains the original linguistic content and annotation structure, ensuring compatibility with existing workflows.
- Broader Application Scope: Suitable for professional-grade audio synthesis, TTS systems, and other high-quality audio applications.
- Open Source: Freely available for academic and research purposes, facilitating innovation in the speech and audio domains.
**Original Dataset**
- Source: The original LJSpeech-1.1 dataset contains 13,100 audio clips of a single female speaker reading excerpts from public-domain books.
- Duration: Approximately 24 hours of speech data.
- Annotations: Each audio clip is paired with a corresponding text transcription.
**Super-Resolution Processing**
The original 22,050 Hz audio is processed using a state-of-the-art MossFormer2-based speech super-resolution model. This model incorporates the following technologies:
- Advanced Neural Architecture: Combines Transformer-based sequence modeling and HiFi-GAN convolutional generative networks.
- Perceptual Optimization: Adopts a loss function specifically designed to preserve the naturalness and clarity of speech.
- High-Frequency Reconstruction: The algorithm is specifically optimized to recover lost high-frequency components, ensuring smooth and artifact-free enhanced results.
**Output Format**
- Sampling Rate: 48,000 Hz
- Audio Format: WAV
- Bit Depth: 16-bit
- Channel Configuration: Mono
**Application Scenarios**
- Text-to-Speech (TTS) Synthesis
- Training High-Fidelity TTS Systems: Generates more realistic speech outputs.
- Supporting Emotional Expression: Enables TTS synthesis with greater emotion and expressiveness.
- Speech Super-Resolution Benchmarking
- Serving as a Reference Dataset for Super-Resolution Algorithms: Used to evaluate the performance of speech super-resolution models.
- Providing a Standardized Benchmark for Perceptual Quality: Helping advance speech processing technologies.
- Audio Enhancement and Restoration
- Restoring Low-Resolution or Degraded Speech Signals: Meeting the requirements of professional applications.
- Creating High-Quality Dubbing and Narration: Used for multimedia project production.
**File Structure**
The dataset retains the directory structure of the original LJSpeech-1.1 for ease of use:
LJSpeech-1.1-48kHz/
├── metadata.csv # Text transcription and audio file mapping
├── wavs/ # Directory containing 48,000 Hz WAV files
└── LICENSE.txt # License information
**License Agreement**
The LJSpeech-1.1 high-resolution dataset is released under the open-source license of the original LJSpeech-1.1 dataset. Users may freely use, modify, and share the dataset for academic and non-commercial purposes, provided that appropriate attribution is given.
提供机构:
maas
创建时间:
2025-01-14
搜集汇总
数据集介绍

背景与挑战
背景概述
LJSpeech-1.1-48kHz是LJSpeech-1.1数据集的高清升级版本,通过超分辨率算法将采样率从22,050 Hz提升至48,000 Hz,提供高保真音频。它包含13,100个音频片段,约24小时时长,适用于文本到语音合成和高级音频处理任务,并保持原始内容和结构。
以上内容由遇见数据集搜集并总结生成



