ecker/libritts-small
收藏Hugging Face2023-03-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ecker/libritts-small
下载链接
链接失效反馈官方服务:
资源简介:
# LibriSpeech-Finetuning for VALL-E
Included is a dataset I've prepared for training with [my fork of a VALL-E implementation](https://git.ecker.tech/mrq/vall-e), sourced from [LibriSpeech-Finetuning](https://dl.fbaipublicfiles.com/librilight/data/librispeech_finetuning.tgz).
>\> What makes this different?
I've trimmed them down to better train against them, as too large of a piece of data will increase VRAM use drastically:
* I re-transcribed using [m-bain/WhisperX](https://github.com/m-bain/whisperX/)'s large-v2 model and using the VAD filter to get near-perfect timestamps.
* I then bias the start by -0.05 seconds, and the ends by 0.05 seconds).
* very short segments are merged with preceding ones to avoid fragmenting too much
* the source audio is then sliced according to each segment, and each segment gets phonemized using [bootphon/phonemizer](https://github.com/bootphon/phonemizer/) (espeak backend).
* finally, the sliced audio is quantized using Encodec, for VALL-E's use.
This will help alleviate problems from the default `max_phoneme` length ignoring a large chunk of the dataset, and relatively evenly distributing lengths.
提供机构:
ecker
原始信息汇总
数据集概述
数据来源
- 本数据集源自LibriSpeech-Finetuning,用于训练VALL-E实现的分支。
数据处理
- 数据经过重新转录,使用m-bain/WhisperXs large-v2模型,并通过VAD过滤器获取接近完美的时戳。
- 开始时间偏移-0.05秒,结束时间偏移0.05秒。
- 非常短的片段与前一片段合并,以避免过度碎片化。
- 源音频根据每个片段进行切片,每个片段通过bootphon/phonemizer(espeak后端)进行音素化。
- 最终,切片音频通过Encodec进行量化,以供VALL-E使用。
处理目的
- 此处理旨在解决默认
max_phoneme长度忽略大量数据集的问题,并相对均匀地分配长度,从而帮助缓解相关问题。



