rishchen/ukrainian-tts-audiobook-pani-nina-parquet-old
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/rishchen/ukrainian-tts-audiobook-pani-nina-parquet-old
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- text-to-speech
language:
- uk
size_categories:
- 100K<n<1M
---
# Ukrainian TTS audiobook dataset (Parquet)
Segmented Ukrainian audiobook speech with aligned text, prepared for training and evaluating Text-to-Speech (TTS) models. The dataset is published as Hugging Face-compatible Parquet shards so the Hub **Dataset Preview** can render an `audio` column.
That was mabe by using `whisper` (https://github.com/openai/whisper) and `ffmpeg` (https://www.ffmpeg.org/), where with whisper we set start and end of voices + transcribe it and using ffmpeg slice into ~2-10 seconds pieces.
## Motivation / use case
- Train Ukrainian TTS / speech synthesis models on long-form narrated speech.
- Provide a simple tabular format (`audio` + `text` + metadata) that works well with `datasets`.
## Dataset format
Each example is one utterance:
- `id` (`int64`): sequential index (0..N-1)
- `path` (`string`): relative path of the original `.wav` (kept for traceability)
- `audio` (`Audio`): Hugging Face audio feature stored as a struct `{bytes, path}`
- `text` (`string`): Ukrainian transcript
- `duration` (`float32`): seconds
- `source` (`string`): original source recording name (from `metadata.jsonl`)
## Dataset stats
- Rows: `116,575`
- Total duration: `~114.9 hours`
- Sources: `39` unique recordings (see `source` field)
- Audio: mono, PCM16, 16 kHz WAV
## Install deps
```bash
python -m pip install huggingface_hub==0.30.2 datasets==3.5.0
```
## Load with `datasets`
From the Hub:
```python
from datasets import load_dataset
ds = load_dataset("rishchen/ukrainian-tts-audiobook-pani-nina-parquet", split="train")
```
From local Parquet shards:
```python
from datasets import load_dataset
ds = load_dataset("parquet", data_files={"train": "train/train-*.parquet"})["train"]
```
## Download from Hugging Face
This repo is set up to upload the contents of `train/` as the dataset repository root (so `wavs/...` paths match the Parquet).
```bash
export HF_TOKEN="hf_..."
python download.py
python unpack.py
```
## Notes for Hub Dataset Preview
- The Parquet shards include Hugging Face `datasets` feature metadata so `audio` is recognized as an `Audio` column.
- If you generate *path-only* Parquet (default), keep the referenced `.wav` files in the same repo so `audio.path` resolves.
- If you generate Parquet with `--embed-audio-bytes`, the dataset is self-contained and you can upload only Parquet.
- If you publish both `metadata.jsonl` and Parquet, the Hub may try to load either format; if you want Parquet-only, upload only the shards (or rename/move the raw JSONL out of the dataset root before pushing).
提供机构:
rishchen



