five

ecker/libritts-small

收藏
Hugging Face2023-03-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ecker/libritts-small
下载链接
链接失效反馈
官方服务:
资源简介:
# LibriSpeech-Finetuning for VALL-E Included is a dataset I've prepared for training with [my fork of a VALL-E implementation](https://git.ecker.tech/mrq/vall-e), sourced from [LibriSpeech-Finetuning](https://dl.fbaipublicfiles.com/librilight/data/librispeech_finetuning.tgz). >\> What makes this different? I've trimmed them down to better train against them, as too large of a piece of data will increase VRAM use drastically: * I re-transcribed using [m-bain/WhisperX](https://github.com/m-bain/whisperX/)'s large-v2 model and using the VAD filter to get near-perfect timestamps. * I then bias the start by -0.05 seconds, and the ends by 0.05 seconds). * very short segments are merged with preceding ones to avoid fragmenting too much * the source audio is then sliced according to each segment, and each segment gets phonemized using [bootphon/phonemizer](https://github.com/bootphon/phonemizer/) (espeak backend). * finally, the sliced audio is quantized using Encodec, for VALL-E's use. This will help alleviate problems from the default `max_phoneme` length ignoring a large chunk of the dataset, and relatively evenly distributing lengths.
提供机构:
ecker
原始信息汇总

数据集概述

数据来源

数据处理

  • 数据经过重新转录,使用m-bain/WhisperXs large-v2模型,并通过VAD过滤器获取接近完美的时戳。
  • 开始时间偏移-0.05秒,结束时间偏移0.05秒。
  • 非常短的片段与前一片段合并,以避免过度碎片化。
  • 源音频根据每个片段进行切片,每个片段通过bootphon/phonemizer(espeak后端)进行音素化。
  • 最终,切片音频通过Encodec进行量化,以供VALL-E使用。

处理目的

  • 此处理旨在解决默认max_phoneme长度忽略大量数据集的问题,并相对均匀地分配长度,从而帮助缓解相关问题。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作