ecker/libritts-small

Name: ecker/libritts-small
Creator: ecker
Published: 2023-03-24 14:24:16
License: 暂无描述

Hugging Face2023-03-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ecker/libritts-small

下载链接

链接失效反馈

官方服务：

资源简介：

# LibriSpeech-Finetuning for VALL-E Included is a dataset I've prepared for training with [my fork of a VALL-E implementation](https://git.ecker.tech/mrq/vall-e), sourced from [LibriSpeech-Finetuning](https://dl.fbaipublicfiles.com/librilight/data/librispeech_finetuning.tgz). >\> What makes this different? I've trimmed them down to better train against them, as too large of a piece of data will increase VRAM use drastically: * I re-transcribed using [m-bain/WhisperX](https://github.com/m-bain/whisperX/)'s large-v2 model and using the VAD filter to get near-perfect timestamps. * I then bias the start by -0.05 seconds, and the ends by 0.05 seconds). * very short segments are merged with preceding ones to avoid fragmenting too much * the source audio is then sliced according to each segment, and each segment gets phonemized using [bootphon/phonemizer](https://github.com/bootphon/phonemizer/) (espeak backend). * finally, the sliced audio is quantized using Encodec, for VALL-E's use. This will help alleviate problems from the default `max_phoneme` length ignoring a large chunk of the dataset, and relatively evenly distributing lengths.

提供机构：

ecker

原始信息汇总

数据集概述

数据来源

本数据集源自LibriSpeech-Finetuning，用于训练VALL-E实现的分支。

数据处理

数据经过重新转录，使用m-bain/WhisperXs large-v2模型，并通过VAD过滤器获取接近完美的时戳。
开始时间偏移-0.05秒，结束时间偏移0.05秒。
非常短的片段与前一片段合并，以避免过度碎片化。
源音频根据每个片段进行切片，每个片段通过bootphon/phonemizer（espeak后端）进行音素化。
最终，切片音频通过Encodec进行量化，以供VALL-E使用。

处理目的

此处理旨在解决默认max_phoneme长度忽略大量数据集的问题，并相对均匀地分配长度，从而帮助缓解相关问题。

5,000+

优质数据集

54 个

任务类型

进入经典数据集