five

LJSpeech-1.1-48kHz高清语音合成数据集

收藏
魔搭社区2026-05-21 更新2025-01-18 收录
下载链接:
https://modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz
下载链接
链接失效反馈
官方服务:
资源简介:
LJSpeech-1.1 数据集因其在语音合成(TTS)和其他语音处理任务中的广泛应用而广为人知。现在,通过先进的语音超分辨率算法,这一数据集得到了进一步增强。 原始数据集的采样率为 22,050 Hz,现在使用 ClearerVoice-Studio 工具将其升级至 48,000 Hz,为高级音频处理任务提供高保真的音频版本。 **下载方法** - SDK下载 ```python #验证SDK token from modelscope.hub.api import HubApi api = HubApi() api.login('fadd1abb-4df6-4807-9051-5ab01ac81071') #数据集下载 from modelscope.msdatasets import MsDataset ds = MsDataset.load('iic/LJSpeech-1.1-48kHz') #您可按需配置 subset_name、split,参照“快速使用”示例代码 ``` - GIT Clone, 请确保 lfs 已经被正确安装 ```sh git lfs install git clone https://oauth2:JcynSwnM9dvj1HatM2Po@www.modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz.git ``` - Huggingface 下载地址:(https://huggingface.co/datasets/alibabasglab/LJSpeech-1.1-48kHz) **主要特点** - 高分辨率音频:数据集现以 48,000 Hz 的采样率提供音频文件,增强了感知质量,具有更丰富的高频细节。 - 原始内容完整性:保留了原始语言内容和注释结构,确保与现有工作流的兼容性。 - 更广的应用范围:适用于专业级音频合成、TTS 系统以及其他高质量音频应用。 - 开源:免费提供用于学术和研究目的,促进语音和音频领域的创新。 **原始数据集** - 来源:原始 LJSpeech-1.1 数据集包含 13,100 条单一女性讲话者朗读公共领域书籍片段的音频片段。 - 时长:约 24 小时的语音数据。 - 注释:每条音频片段均配有相应的文本转录。 **超分辨率处理** 原始 22,050 Hz 音频通过最先进的基于 MossFormer2 的语音超分辨率模型进行处理。该模型采用以下技术: - 高级神经架构:结合了基于 Transformer 的序列建模和HiFi-GAN卷积生成网络。 - 感知优化:采用专为保持语音自然性和清晰度设计的损失函数。 - 高频重建:算法专门针对恢复丢失的高频成分进行优化,确保平滑且无伪影的增强效果。 **输出格式** - 采样率:48,000 Hz - 音频格式:WAV - 比特深度:16 位 - 声道配置:单声道 **使用场景** - 文本到语音(TTS)合成 - 训练高保真 TTS 系统:生成更加逼真的语音输出。 - 支持情感化表达:实现更具情感与表现力的 TTS 合成。 - 语音超分辨率基准测试 - 作为超分辨率算法的参考数据集:用于评估语音超分辨率模型的性能。 - 提供感知质量的标准化基准:助力提升语音处理技术。 - 音频增强与修复 - 修复低分辨率或退化的语音信号:满足专业应用需求。 - 创建高质量配音和旁白:用于多媒体项目制作。 **文件结构** 数据集保留了原始 LJSpeech-1.1 的目录结构,便于使用: LJSpeech-1.1-48kHz/ ├── metadata.csv         # 文本转录和音频文件映射 ├── wavs/               # 包含 48,000 Hz WAV 文件的目录 └── LICENSE.txt         # 许可信息 **许可协议** LJSpeech-1.1 高分辨率数据集根据原始 LJSpeech-1.1 数据集的开源许可发布,用户可以免费使用、修改和共享该数据集用于学术和非商业用途,前提是必须给予适当的署名。 ​

The LJSpeech-1.1 dataset is widely recognized for its extensive applications in text-to-speech (TTS) synthesis and other speech processing tasks. It has now been further enhanced via advanced speech super-resolution algorithms. The original dataset had a sampling rate of 22,050 Hz, and it has been upsampled to 48,000 Hz using the ClearerVoice-Studio tool, providing high-fidelity audio versions for advanced audio processing tasks. **Download Methods** - SDK Download python # Validate SDK token from modelscope.hub.api import HubApi api = HubApi() api.login('fadd1abb-4df6-4807-9051-5ab01ac81071') # Dataset download from modelscope.msdatasets import MsDataset ds = MsDataset.load('iic/LJSpeech-1.1-48kHz') # You can configure subset_name and split as needed, refer to the "Quick Usage" example code - Git Clone, please ensure LFS is properly installed sh git lfs install git clone https://oauth2:JcynSwnM9dvj1HatM2Po@www.modelscope.cn/datasets/iic/LJSpeech-1.1-48kHz.git - Huggingface Download Link: https://huggingface.co/datasets/alibabasglab/LJSpeech-1.1-48kHz **Key Features** - High-Resolution Audio: The dataset now provides audio files at a sampling rate of 48,000 Hz, enhancing perceptual quality with richer high-frequency details. - Original Content Integrity: Retains the original linguistic content and annotation structure, ensuring compatibility with existing workflows. - Broader Application Scope: Suitable for professional-grade audio synthesis, TTS systems, and other high-quality audio applications. - Open Source: Freely available for academic and research purposes, facilitating innovation in the speech and audio domains. **Original Dataset** - Source: The original LJSpeech-1.1 dataset contains 13,100 audio clips of a single female speaker reading excerpts from public-domain books. - Duration: Approximately 24 hours of speech data. - Annotations: Each audio clip is paired with a corresponding text transcription. **Super-Resolution Processing** The original 22,050 Hz audio is processed using a state-of-the-art MossFormer2-based speech super-resolution model. This model incorporates the following technologies: - Advanced Neural Architecture: Combines Transformer-based sequence modeling and HiFi-GAN convolutional generative networks. - Perceptual Optimization: Adopts a loss function specifically designed to preserve the naturalness and clarity of speech. - High-Frequency Reconstruction: The algorithm is specifically optimized to recover lost high-frequency components, ensuring smooth and artifact-free enhanced results. **Output Format** - Sampling Rate: 48,000 Hz - Audio Format: WAV - Bit Depth: 16-bit - Channel Configuration: Mono **Application Scenarios** - Text-to-Speech (TTS) Synthesis - Training High-Fidelity TTS Systems: Generates more realistic speech outputs. - Supporting Emotional Expression: Enables TTS synthesis with greater emotion and expressiveness. - Speech Super-Resolution Benchmarking - Serving as a Reference Dataset for Super-Resolution Algorithms: Used to evaluate the performance of speech super-resolution models. - Providing a Standardized Benchmark for Perceptual Quality: Helping advance speech processing technologies. - Audio Enhancement and Restoration - Restoring Low-Resolution or Degraded Speech Signals: Meeting the requirements of professional applications. - Creating High-Quality Dubbing and Narration: Used for multimedia project production. **File Structure** The dataset retains the directory structure of the original LJSpeech-1.1 for ease of use: LJSpeech-1.1-48kHz/ ├── metadata.csv # Text transcription and audio file mapping ├── wavs/ # Directory containing 48,000 Hz WAV files └── LICENSE.txt # License information **License Agreement** The LJSpeech-1.1 high-resolution dataset is released under the open-source license of the original LJSpeech-1.1 dataset. Users may freely use, modify, and share the dataset for academic and non-commercial purposes, provided that appropriate attribution is given.
提供机构:
maas
创建时间:
2025-01-14
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
LJSpeech-1.1-48kHz是LJSpeech-1.1数据集的高清升级版本,通过超分辨率算法将采样率从22,050 Hz提升至48,000 Hz,提供高保真音频。它包含13,100个音频片段,约24小时时长,适用于文本到语音合成和高级音频处理任务,并保持原始内容和结构。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务