five

rishchen/ukrainian-tts-audiobook-pani-nina-parquet-old

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/rishchen/ukrainian-tts-audiobook-pani-nina-parquet-old
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - text-to-speech language: - uk size_categories: - 100K<n<1M --- # Ukrainian TTS audiobook dataset (Parquet) Segmented Ukrainian audiobook speech with aligned text, prepared for training and evaluating Text-to-Speech (TTS) models. The dataset is published as Hugging Face-compatible Parquet shards so the Hub **Dataset Preview** can render an `audio` column. That was mabe by using `whisper` (https://github.com/openai/whisper) and `ffmpeg` (https://www.ffmpeg.org/), where with whisper we set start and end of voices + transcribe it and using ffmpeg slice into ~2-10 seconds pieces. ## Motivation / use case - Train Ukrainian TTS / speech synthesis models on long-form narrated speech. - Provide a simple tabular format (`audio` + `text` + metadata) that works well with `datasets`. ## Dataset format Each example is one utterance: - `id` (`int64`): sequential index (0..N-1) - `path` (`string`): relative path of the original `.wav` (kept for traceability) - `audio` (`Audio`): Hugging Face audio feature stored as a struct `{bytes, path}` - `text` (`string`): Ukrainian transcript - `duration` (`float32`): seconds - `source` (`string`): original source recording name (from `metadata.jsonl`) ## Dataset stats - Rows: `116,575` - Total duration: `~114.9 hours` - Sources: `39` unique recordings (see `source` field) - Audio: mono, PCM16, 16 kHz WAV ## Install deps ```bash python -m pip install huggingface_hub==0.30.2 datasets==3.5.0 ``` ## Load with `datasets` From the Hub: ```python from datasets import load_dataset ds = load_dataset("rishchen/ukrainian-tts-audiobook-pani-nina-parquet", split="train") ``` From local Parquet shards: ```python from datasets import load_dataset ds = load_dataset("parquet", data_files={"train": "train/train-*.parquet"})["train"] ``` ## Download from Hugging Face This repo is set up to upload the contents of `train/` as the dataset repository root (so `wavs/...` paths match the Parquet). ```bash export HF_TOKEN="hf_..." python download.py python unpack.py ``` ## Notes for Hub Dataset Preview - The Parquet shards include Hugging Face `datasets` feature metadata so `audio` is recognized as an `Audio` column. - If you generate *path-only* Parquet (default), keep the referenced `.wav` files in the same repo so `audio.path` resolves. - If you generate Parquet with `--embed-audio-bytes`, the dataset is self-contained and you can upload only Parquet. - If you publish both `metadata.jsonl` and Parquet, the Hub may try to load either format; if you want Parquet-only, upload only the shards (or rename/move the raw JSONL out of the dataset root before pushing).
提供机构:
rishchen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作