hifitts-2
收藏魔搭社区2026-01-06 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/nv-community/hifitts-2
下载链接
链接失效反馈官方服务:
资源简介:
# HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset
<style>
img {
display: inline-table;
vertical-align: small;
margin: 0;
padding: 0;
}
</style>
## Dataset Description
This repository contains the metadata for HiFiTTS-2, a large scale speech dataset derived from LibriVox audiobooks. For more details, please refer to [our paper](https://arxiv.org/abs/2506.04152).
The dataset contains metadata for approximately 36.7k hours of audio from 5k speakers that can be downloaded from LibriVox at a 48 kHz sampling rate.
The metadata contains estimated bandwidth, which can be used to infer the original sampling rate the audio was recorded at. The base dataset is filtered for a bandwidth appropriate for training speech models at 22 kHz. We also provide a precomputed subset with 31.7k hours appropriate for 44 kHz training. Users can modify the download script to use any sampling rate and bandwidth threshold which might be more appropriate for their work.
LibriVox audiobooks are not redistributed on Hugging Face. All audio in the dataset can be downloaded from LibriVox, following the instructions below.
### Frequently Asked Questions
- Downloading the 22 kHz version will require approximately 2.8TB of disk space. Downloading the 44 kHz version will require approximately 4.0TB of disk space.
- During download there might be warning messages from "libmpg123". These warnings can be safely ignored. These errors look like `[src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 119 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.`
- By default, the script will download audio files into the workspace directory under *{workspace_dir}/audio_22khz*. The download will ignore HTTP errors and store information for any failed downloads into *{workspace_dir}/errors_22khz.json*. A new manifest will be created at *{worksapce_dir}/manifest_filtered_22khz.json* with utterances from failed audiobooks removed. You can override the default behavior by modifying the [config.yaml file](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/english/hifitts2/config_22khz.yaml) in your local SDP repository.
- If you want to retry the download for failed audiobooks, rerun the script with the output *errors_22khz.json* file:
```bash
python /home/NeMo-speech-data-processor/main.py \
--config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2" \
--config-name="config_22khz.yaml" \
workspace_dir="/home/hifitts2" \
chapter_filename="/home/hifitts2/errors_22khz.json" \
max_workers=8
```
## Dataset Format
The dataset contains an utterance level manifest with these fields:
- **audio_filepath**: Relative path where utterance is stored
- **speaker**: LibriVox speaker ID
- **set**: Dataset partition, either "train", "test_seen", "dev_seen", "test_unseen", or "dev_unseen"
- **duration**: Duration of utterance
- **bandwidth**: Estimated bandwidth of audiobook chapter containing this utterance
- **speaker_count**: Number of speakers detected in this utterance
- **wer**: ASR word error rate of *normalized_text*
- **cer**: ASR character error rate of *normalized_text*
- **text_source**: Data source text was taken from, either 'book' or 'mls'
- **text**: Original data source transcription
- **normalized_text**: Transcription output by text processing pipeline
The dataset contains an audiobook chapter level manifest with these fields:
- **url**: Download URL for the LibriVox audiobook chapter
- **chapter_filepath**: Relative path where audiobook chapter is stored
- **duration**: Duration of chapter
- **bandwidth**: Bandwidth estimated using the first 30 seconds of the chapter
- **utterances**: List of utterance metadata with the following fields
- **utterances.audio_filepath**: Relative path where utterance is stored
- **utterances.offset**: Offset of utterance within the chapter
- **utterances.duration**: Duration of utterance
Bandwidth is estimated from the first 30 seconds of each audiobook using the approach from [Speech Data Processor (SDP) Toolkit](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/sdp/processors/modify_manifest/data_to_data.py#L1259). The bandwidth fmax is estimated by using the mean of the power spectrum to find the highest frequency that has at least -50 dB level
relative to the peak value of the spectrum, namely,
$$f_{\text{max}} = \max\left\{f \in [0, f_{\text{Nyquist}}] \, \bigg|\, 10 \log_{10} \left(\frac{P(f)}{P_{\text{peak}}}\right) \geq -50\, \text{dB}\right\}$$
where `P(f)` is the power spectral density and `P_peak` the maximum spectral power.
## Download Instructions
1. Download the *manifet.json* file and *chapter.json* files corresponding to your desired sampling rate from this Hugging Face repository. Copy these into a workspace directory (in this example */home/hifitts2*).
2. Install NeMo-speech-data-processor (SDP) using the *Installation* instructions on https://github.com/NVIDIA/NeMo-speech-data-processor.
3. Run the SDP script to download the dataset to local disk.
```bash
python /home/NeMo-speech-data-processor/main.py \
--config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2" \
--config-name="config_22khz.yaml" \
workspace_dir="/home/hifitts2" \
max_workers=8
```
*max_workers* is the number of threads to use for downloading the data. To download the 44khz dataset, specify *config_44khz.yaml*.
Please see [FAQs](#frequently-asked-questions) for further help regarding download. Or raise an issue on the community tab.
## Dataset Owner(s)
NVIDIA Corporation
## Dataset Creation Date
June 2025
## License/Terms of Use
GOVERNING TERMS: This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode.
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation
If you find this dataset useful, please cite:
```
@inproceedings{rlangman2025hifitts2,
title={HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset},
author={Ryan Langman and Xuesong Yang and Paarth Neekhara and Shehzeen Hussain and Edresson Casanova and Evelina Bakhturina and Jason Li},
booktitle={Interspeech},
year={2025},
}
```
# HiFiTTS-2:大规模高带宽语音数据集
<style>
img {
display: inline-table;
vertical-align: small;
margin: 0;
padding: 0;
}
</style>
## 数据集描述
本仓库包含HiFiTTS-2的元数据,该数据集是从LibriVox有声读物中衍生出的大规模语音数据集。如需了解更多细节,请参阅[我们的论文](https://arxiv.org/abs/2506.04152)。
该数据集包含约36.7万小时的语音元数据,涵盖5000名说话人,音频可从LibriVox以48kHz采样率下载。
元数据包含估算的带宽,可用于推断音频录制时的原始采样率。基础数据集经过过滤,适配22kHz的语音模型训练所需带宽。我们还提供了预计算的子集,包含约31.7万小时数据,适配44kHz的模型训练需求。用户可修改下载脚本,以采用更符合自身研究需求的采样率与带宽阈值。
LibriVox有声读物不会在Hugging Face上重新分发。数据集内的所有音频均可按照下述说明从LibriVox下载。
### 常见问题解答
- 下载22kHz版本约需2.8TB磁盘空间,下载44kHz版本约需4.0TB磁盘空间。
- 下载过程中可能会出现来自libmpg123的警告信息,此类警告可安全忽略。此类错误形如`[src/libmpg123/id3.c:INT123_id3_to_utf8():394] warning: Weird tag size 119 for encoding 1 - I will probably trim too early or something but I think the MP3 is broken.`
- 默认情况下,脚本会将音频文件下载至工作目录下的*{workspace_dir}/audio_22khz*路径中。下载过程会忽略HTTP错误,并将所有下载失败的信息存储至*{workspace_dir}/errors_22khz.json*。同时会在*{worksapce_dir}/manifest_filtered_22khz.json*生成新的清单,移除来自下载失败有声读物的语音片段。您可通过修改本地SDP仓库中的[config.yaml配置文件](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/english/hifitts2/config_22khz.yaml)来覆盖默认行为。
- 若需重试下载失败的有声读物,请使用生成的*errors_22khz.json*文件重新运行脚本:
bash
python /home/NeMo-speech-data-processor/main.py
--config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2"
--config-name="config_22khz.yaml"
workspace_dir="/home/hifitts2"
chapter_filename="/home/hifitts2/errors_22khz.json"
max_workers=8
## 数据集格式
数据集包含语句级清单,其字段如下:
- **audio_filepath**:语音片段存储的相对路径
- **speaker**:LibriVox说话人ID
- **set**:数据集分区,可选值为`train`、`test_seen`、`dev_seen`、`test_unseen`或`dev_unseen`
- **duration**:语音片段时长
- **bandwidth**:包含该语音片段的有声读物章节的估算带宽
- **speaker_count**:该语音片段中检测到的说话人数量
- **wer**:*normalized_text*的自动语音识别(ASR,Automatic Speech Recognition)词错误率
- **cer**:*normalized_text*的自动语音识别字符错误率
- **text_source**:文本来源,可选值为`book`或`mls`
- **text**:原始数据源的转录文本
- **normalized_text**:文本处理流水线输出的转录结果
数据集还包含有声读物章节级清单,其字段如下:
- **url**:LibriVox有声读物章节的下载链接
- **chapter_filepath**:有声读物章节存储的相对路径
- **duration**:章节时长
- **bandwidth**:基于章节前30秒估算的带宽
- **utterances**:语音片段元数据列表,包含以下字段
- **utterances.audio_filepath**:语音片段存储的相对路径
- **utterances.offset**:语音片段在章节中的偏移量
- **utterances.duration**:语音片段时长
带宽估算基于每本有声读物的前30秒,采用[语音数据处理器(SDP,Speech Data Processor)工具包](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/sdp/processors/modify_manifest/data_to_data.py#L1259)中的方法。带宽$f_{ ext{max}}$的估算通过对功率谱取均值,找到相对于频谱峰值至少为-50dB的最高频率,即:
$$f_{ ext{max}} = maxleft{f in [0, f_{ ext{Nyquist}}] , igg|, 10 log_{10} left(frac{P(f)}{P_{ ext{peak}}}
ight) geq -50, ext{dB}
ight}$$
其中`P(f)`为功率谱密度,`P_peak`为频谱的最大峰值功率。
## 下载说明
1. 从本Hugging Face仓库下载对应所需采样率的*manifest.json*与*chapter.json*文件,并将其复制至工作目录(本示例中为*/home/hifitts2*)。
2. 按照https://github.com/NVIDIA/NeMo-speech-data-processor上的*安装*说明安装NeMo-speech-data-processor(SDP)。
3. 运行SDP脚本将数据集下载至本地磁盘。
bash
python /home/NeMo-speech-data-processor/main.py
--config-path="/home/NeMo-speech-data-processor/dataset_configs/english/hifitts2"
--config-name="config_22khz.yaml"
workspace_dir="/home/hifitts2"
max_workers=8
`max_workers`为下载数据时使用的线程数。若需下载44kHz数据集,请指定`config_44khz.yaml`。
如需获取更多下载相关帮助,请参阅[常见问题解答](#frequently-asked-questions),或在社区标签页提交问题。
## 数据集所有者
NVIDIA(英伟达)公司
## 数据集创建日期
2025年6月
## 使用许可条款
适用条款:本数据集采用知识共享署名4.0国际许可协议(CC BY 4.0)进行授权,许可协议详情可参阅https://creativecommons.org/licenses/by/4.0/legalcode。
## 伦理考量
英伟达认为,可信人工智能是一项共同责任,我们已制定相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照服务条款下载或使用本数据集时,应与内部模型团队协作,确保模型符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或英伟达人工智能相关问题。
## 引用方式
若您认为本数据集对您的研究有所帮助,请引用如下文献:
@inproceedings{rlangman2025hifitts2,
title={HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset},
author={Ryan Langman and Xuesong Yang and Paarth Neekhara and Shehzeen Hussain and Edresson Casanova and Evelina Bakhturina and Jason Li},
booktitle={Interspeech},
year={2025},
}
提供机构:
maas
创建时间:
2025-06-05
搜集汇总
数据集介绍

背景与挑战
背景概述
HiFiTTS-2是一个基于LibriVox有声读物的大规模高带宽语音数据集,包含约36.7k小时的音频元数据,涉及5k名说话者,采样率为48 kHz。数据集提供了适合22 kHz和44 kHz训练的过滤版本,并包含详细的元数据字段和下载指南。
以上内容由遇见数据集搜集并总结生成



