rishchen/ukrainian-tts-audiobook-pani-nina-parquet
收藏Hugging Face2026-04-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/rishchen/ukrainian-tts-audiobook-pani-nina-parquet
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- text-to-speech
language:
- uk
size_categories:
- 100K<n<1M
---
# Ukrainian TTS audiobook dataset (Parquet)
Segmented Ukrainian audiobook speech with aligned text, prepared for training and evaluating Text-to-Speech (TTS) models.
The dataset is published as Hugging Face-compatible Parquet shards so the Hub **Dataset Preview** can render an `audio` column.
The dataset was prepared using [whisper](https://github.com/openai/whisper) and [ffmpeg](https://www.ffmpeg.org/):
- Whisper was used for transcription and approximate segment timing.
- FFmpeg was used to slice audio into short utterances (roughly 2-10 seconds).
## Motivation / use case
- Train Ukrainian TTS / speech synthesis models on long-form narrated speech.
- Use a simple tabular format (`audio` + `text` + metadata) that works with `datasets`.
- Keep a reversible pack/unpack workflow for Parquet distribution.
## Dataset format
Each example is one utterance:
- `id` (`int64`): sequential index (0..N-1)
- `path` (`string`): normalized relative path used by HF Audio
- `audio` (`Audio`): Hugging Face audio feature stored as struct `{bytes, path}`
- `original_path` (`string`): exact original path value from source metadata
- `text` (`string`): Ukrainian transcript
- `text_normalaised` (`string`): normalized transcript (if available)
- `text_phonemized` (`string`): phonemized transcript (if available)
- `text_normalaised_phonemized` (`string`): normalized+phonemized transcript (if available)
- `duration` (`float32`): seconds
- `wer` (`float32`): quality proxy from ASR alignment
- `source` (`string`): original source recording name
## Dataset stats
- Rows: `138,447`
- Total duration: `~143.6 hours`
- Sources: `39` unique recordings (see `source` field)
- Audio format: mono, PCM16, 16 kHz WAV
## Install deps
```bash
python -m pip install pyarrow datasets huggingface_hub
```
## Load with huggingface_hub (preferable)
```python
import os
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="rishchen/ukrainian-tts-audiobook-pani-nina-parquet",
repo_type="dataset",
local_dir="hf_parquet",
allow_patterns=["*"],
token=os.getenv("HF_TOKEN"),
)
```
## Load with `datasets`
From the Hub:
```python
from datasets import load_dataset
ds = load_dataset("rishchen/ukrainian-tts-audiobook-pani-nina-parquet", split="train")
```
From local Parquet shards:
```python
from datasets import load_dataset
ds = load_dataset("parquet", data_files={"train": "train_parquet/*.parquet"})["train"]
```
## Pack into Parquet
Default behavior packs wav bytes into parquet (`audio.bytes`) so shards are self-contained:
```bash
python make_hf_parquet.py \
--split-dir train \
--path-key auto \
--rows-per-shard 10000 \
--output-dir train_parquet \
--overwrite
```
If you want path-only parquet (smaller files), disable embedding:
```bash
python make_hf_parquet.py \
--split-dir train \
--path-key auto \
--rows-per-shard 10000 \
--no-embed-audio-bytes \
--output-dir train_parquet \
--overwrite
```
## Unpack from Parquet
This restores audio files and regenerates metadata with matching original path key/value:
```bash
python unpack_hf_parquet.py \
--input-dir train_parquet \
--output-dir train_unpacked \
--overwrite
```
For path-only parquet, also provide original audio root:
```bash
python unpack_hf_parquet.py \
--input-dir train_parquet \
--output-dir train_unpacked \
--source-audio-root train \
--overwrite
```
提供机构:
rishchen



