five

rishchen/ukrainian-tts-audiobook-pani-nina-parquet

收藏
Hugging Face2026-04-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/rishchen/ukrainian-tts-audiobook-pani-nina-parquet
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - text-to-speech language: - uk size_categories: - 100K<n<1M --- # Ukrainian TTS audiobook dataset (Parquet) Segmented Ukrainian audiobook speech with aligned text, prepared for training and evaluating Text-to-Speech (TTS) models. The dataset is published as Hugging Face-compatible Parquet shards so the Hub **Dataset Preview** can render an `audio` column. The dataset was prepared using [whisper](https://github.com/openai/whisper) and [ffmpeg](https://www.ffmpeg.org/): - Whisper was used for transcription and approximate segment timing. - FFmpeg was used to slice audio into short utterances (roughly 2-10 seconds). ## Motivation / use case - Train Ukrainian TTS / speech synthesis models on long-form narrated speech. - Use a simple tabular format (`audio` + `text` + metadata) that works with `datasets`. - Keep a reversible pack/unpack workflow for Parquet distribution. ## Dataset format Each example is one utterance: - `id` (`int64`): sequential index (0..N-1) - `path` (`string`): normalized relative path used by HF Audio - `audio` (`Audio`): Hugging Face audio feature stored as struct `{bytes, path}` - `original_path` (`string`): exact original path value from source metadata - `text` (`string`): Ukrainian transcript - `text_normalaised` (`string`): normalized transcript (if available) - `text_phonemized` (`string`): phonemized transcript (if available) - `text_normalaised_phonemized` (`string`): normalized+phonemized transcript (if available) - `duration` (`float32`): seconds - `wer` (`float32`): quality proxy from ASR alignment - `source` (`string`): original source recording name ## Dataset stats - Rows: `138,447` - Total duration: `~143.6 hours` - Sources: `39` unique recordings (see `source` field) - Audio format: mono, PCM16, 16 kHz WAV ## Install deps ```bash python -m pip install pyarrow datasets huggingface_hub ``` ## Load with huggingface_hub (preferable) ```python import os from huggingface_hub import snapshot_download snapshot_download( repo_id="rishchen/ukrainian-tts-audiobook-pani-nina-parquet", repo_type="dataset", local_dir="hf_parquet", allow_patterns=["*"], token=os.getenv("HF_TOKEN"), ) ``` ## Load with `datasets` From the Hub: ```python from datasets import load_dataset ds = load_dataset("rishchen/ukrainian-tts-audiobook-pani-nina-parquet", split="train") ``` From local Parquet shards: ```python from datasets import load_dataset ds = load_dataset("parquet", data_files={"train": "train_parquet/*.parquet"})["train"] ``` ## Pack into Parquet Default behavior packs wav bytes into parquet (`audio.bytes`) so shards are self-contained: ```bash python make_hf_parquet.py \ --split-dir train \ --path-key auto \ --rows-per-shard 10000 \ --output-dir train_parquet \ --overwrite ``` If you want path-only parquet (smaller files), disable embedding: ```bash python make_hf_parquet.py \ --split-dir train \ --path-key auto \ --rows-per-shard 10000 \ --no-embed-audio-bytes \ --output-dir train_parquet \ --overwrite ``` ## Unpack from Parquet This restores audio files and regenerates metadata with matching original path key/value: ```bash python unpack_hf_parquet.py \ --input-dir train_parquet \ --output-dir train_unpacked \ --overwrite ``` For path-only parquet, also provide original audio root: ```bash python unpack_hf_parquet.py \ --input-dir train_parquet \ --output-dir train_unpacked \ --source-audio-root train \ --overwrite ```
提供机构:
rishchen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作