eduardem/romanian-speech-v2
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/eduardem/romanian-speech-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ro
license: other
license_name: research-only-mixed
task_categories:
- text-to-speech
- automatic-speech-recognition
tags:
- romanian
- speech
- tts
- multi-speaker
- research-only
pretty_name: Romanian Speech v2
size_categories:
- 100K<n<1M
---
> **Research Use Only** — This dataset is released strictly for personal research and educational
> purposes. The processing pipeline and all scripts are fully open source, but the underlying audio
> originates from sources with varying copyrights. Only short fragments were used under fair use
> provisions and EU Copyright Directive Art. 3 (text and data mining for scientific research).
> This dataset must **not** be used for redistribution of the source material, commercial purposes,
> or training commercially deployed models. No complete copyrighted works are reproduced — the
> resulting segments are short utterances (1–15 seconds) from which a model learns Romanian
> phonetics and prosody, not original content.
# Romanian Speech v2
> **This dataset supersedes [romanian-speech-v1](https://huggingface.co/datasets/eduardem/romanian-speech-v1),
> which is now deprecated.** The v1 dataset was used to fine-tune
> [f5-tts-romanian](https://huggingface.co/eduardem/f5-tts-romanian) and
> [xtts-v2-romanian](https://huggingface.co/eduardem/xtts-v2-romanian).
> After training, we discovered audible artifacts in both models — truncated words, misaligned
> sentence boundaries, and inconsistent speaker labeling — traced back to quality issues in the
> v1 dataset. This v2 release is a complete rebuild with a stricter pipeline: dual-pass Whisper
> transcription with Levenshtein validation (>= 0.96), ensemble speaker clustering (ECAPA2 +
> WeSpeaker), F0-based gender verification, and multi-stage audio quality filters. These changes
> eliminate the artifact classes found in v1.
A curated multi-speaker Romanian speech dataset for TTS model fine-tuning and speech research.
| | |
|---|---|
| **Segments** | 247,244 |
| **Total duration** | 360.16 hours |
| **Speakers** | 21 (11 female / 10 male) |
| **Language** | Romanian (ro) |
| **Audio format** | WAV, 16-bit, mono |
| **Segment duration** | 1–15 seconds |
## Speakers
| Speaker | Gender | Segments | Duration | Sample Rate | Bitrate | Source |
|---------|--------|----------|----------|-------------|---------|--------|
| Adrian | Male | 30,111 | 45.18 h | 32 kHz | 512 kbps | Audiobook |
| Ana | Female | 9,225 | 11.50 h | 22.05 kHz | 352 kbps | Audiobook |
| Andreea | Female | 21,209 | 31.60 h | 44.1 kHz | 705 kbps | Audiobook |
| Andrei | Male | 5,487 | 8.67 h | 22.05 kHz | 352 kbps | LibriVox |
| Bogdan | Male | 1,644 | 2.61 h | 22.05 kHz | 352 kbps | Audiobook |
| Catalin | Male | 220 | 0.34 h | 44.1 kHz | 705 kbps | Audiobook |
| Ciprian | Male | 26,712 | 38.12 h | 22.05 kHz | 352 kbps | Audiobook |
| Cristina | Female | 322 | 0.58 h | 22.05 kHz | 352 kbps | Audiobook |
| Diana | Female | 4 | 0.01 h | 44.1 kHz | 705 kbps | Audiobook |
| Dragos | Male | 6,464 | 10.28 h | 48 kHz | 768 kbps | Audiobook |
| Elena | Female | 3,611 | 4.15 h | 48 kHz | 768 kbps | RSS Corpus |
| Ioana | Female | 503 | 0.87 h | 44.1 kHz | 705 kbps | Audiobook |
| Maria | Female | 1,177 | 2.01 h | 22.05 kHz | 352 kbps | LibriVox |
| Marius | Male | 6,357 | 8.43 h | 48 kHz | 768 kbps | Audiobook |
| Mihaela | Female | 18,584 | 25.33 h | 44.1 kHz | 705 kbps | Audiobook |
| Mihai | Male | 36,775 | 59.66 h | 22.05 kHz | 352 kbps | Audiobook |
| Radu | Male | 17,704 | 26.27 h | 44.1 kHz | 705 kbps | Audiobook |
| Raluca | Female | 28,169 | 37.10 h | 48 kHz | 768 kbps | Audiobook |
| Simona | Female | 1,915 | 2.37 h | 44.1 kHz | 705 kbps | Audiobook |
| Stefan | Male | 8,101 | 13.53 h | 22.05 kHz | 352 kbps | Audiobook |
| Vasile | Male | 22,950 | 31.55 h | 44.1 kHz | 705 kbps | Audiobook |
## Methodology
This section describes the full processing pipeline so that it can be replicated.
All scripts are open source.
### Step 0 — Cleanup and Deduplication
- Delete junk files and directories (`@eaDir`, `.nfo`, `.jpg`, `.txt`, `.DS_Store`)
- Compute SHA-256 hash for every audio file
- Remove exact duplicates (keep first alphabetically)
### Step 1 — Audio Inventory
- Run `ffprobe` on every audio file to extract duration, sample rate, channels, and bitrate
- Reject corrupt files and files shorter than 10 seconds (RSS corpus exempted — its files are pre-segmented utterances of 2–5 seconds)
- For M4B audiobooks: extract individual chapters via ffprobe chapter metadata and inventory each separately
### Step 2 — Speaker Identification
Speaker clustering uses an **ensemble of two embedding models** to maximize accuracy:
1. **ECAPA2** ([Jenthe/ECAPA2](https://huggingface.co/Jenthe/ECAPA2)) — TorchScript model, 192-dimensional embeddings, 0.34% EER on VoxCeleb
2. **WeSpeaker ResNet293-LM** — ONNX model via `onnxruntime-gpu`, 256-dimensional embeddings, 0.53% EER on VoxCeleb
For each audio file, a 30–60 second sample from the middle is extracted and embedded by both models.
Cosine distance matrices from each model are min-max normalized and averaged.
Agglomerative clustering (average linkage) is applied with a distance threshold of 0.25
(validated on a stable plateau at 0.25–0.35).
Known speakers from public datasets (RSS, LibriVox) are assigned directly.
Speaker gender is verified using F0 (fundamental frequency) analysis.
Low-quality clusters (score < 60) and singleton speakers (only 1 source file) are excluded
— 11 speakers dropped, 21 retained.
### Step 3 — Transcription
- Convert source audio to 16 kHz mono WAV (temporary, on local SSD)
- Transcribe with **Whisper large-v3** via [faster-whisper](https://github.com/SYSTRAN/faster-whisper) on CUDA (float16)
- Settings: `language="ro"`, `beam_size=5`, `vad_filter=True`, `word_timestamps=True`
- Output: per-file JSON with word-level timestamps
- For RSS corpus: existing text transcripts from the corpus are also loaded for later comparison
### Step 4 — Sentence Splitting
Segments are split at sentence boundaries using word-level timestamps from Whisper:
- **Primary method**: Split at sentence-ending punctuation (`.`, `?`, `!`), respecting Romanian abbreviations (`d-le`, `nr.`, `str.`, `prof.`, etc.) to avoid false splits
- **Fallback**: When sentence splitting produces fewer than 30% usable segments (common with poetry or text without punctuation), fall back to Whisper's natural VAD-based segments
- **RSS corpus**: Pre-segmented utterances are copied directly (already clean single sentences)
- **Filters**: Reject segments shorter than 1 second or longer than 15 seconds, or with fewer than 2 words
- **Silence padding**: 300 ms of silence appended to each segment
### Step 5 — Transcript Validation
Every segment is **re-transcribed independently** with Whisper large-v3, and the new transcription
is compared against the original (from Step 3/4):
1. Both texts are normalized: lowercased, punctuation stripped, Romanian diacritics normalized
(cedilla forms `ţ/ş` converted to comma-below forms `ț/ș`)
2. Normalized Levenshtein similarity ratio is computed
3. **Segments with ratio < 0.96 are rejected** — this catches truncated words, hallucinated text,
misaligned boundaries, and segments where Whisper was uncertain
4. This dual-pass approach (transcribe the full file, split, then re-transcribe segments) provides
a strong consistency check that neither pass alone can achieve
### Step 6 — Quality Filters
Each segment passes three audio-quality checks:
| Filter | Threshold | Method |
|--------|-----------|--------|
| Signal-to-noise ratio | >= 15 dB | Energy-based SNR estimation |
| Clipping | < 1% of samples | Count samples at max amplitude |
| Silence ratio | < 50% | Fraction of frames below silence threshold |
Segments failing any check are dropped. Spectral flatness is logged but not used as a hard filter.
### Pipeline Funnel
```
7,995 source files (628 hours raw audio)
│
├─ Step 1: Inventory & quality gate
▼
7,455 files transcribed (Step 3)
│
├─ Step 4: Sentence splitting + segment extraction
▼
325,722 segments
│
├─ Step 5: Whisper re-transcription + Levenshtein >= 0.96
▼
253,346 segments (after validation + speaker exclusion)
│
├─ Step 6: SNR, clipping, silence filters
▼
247,244 segments (final dataset — 360.16 hours)
```
Overall acceptance rate: **75.9%** of extracted segments pass all quality checks.
## Dataset Structure
| Column | Type | Description |
|--------|------|-------------|
| `audio` | Audio | WAV audio (16-bit mono, variable sample rate) |
| `text` | string | Romanian transcript |
| `speaker` | string | Speaker name |
| `gender` | string | `male` or `female` |
| `duration` | float | Duration in seconds |
| `source` | string | `audiobook`, `rss`, or `librivox` |
## Sources
| Source | License | Description |
|--------|---------|-------------|
| **RSS corpus v0.8.1** | CC BY-SA 4.0 | Studio recordings from University of Edinburgh, hemianechoic chamber, 48 kHz |
| **LibriVox Romanian** | Public Domain | Volunteer-read audiobooks (2 speakers used) |
| **Audiobooks** | Various / Fair Use | Short fragments used under EU Copyright Directive Art. 3 |
## Legal Notice
This dataset was created for **personal scientific research** under the following legal basis:
- **RSS corpus**: Released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
Attribution: Adriana Stan, University of Edinburgh.
- **LibriVox recordings**: Public domain. No restrictions.
- **Audiobook fragments**: Used under the **EU Copyright Directive (2019/790) Article 3**,
which explicitly permits text and data mining for scientific research purposes.
Only short fragments (1–15 seconds) are included — no complete works are reproduced.
The dataset is used to learn Romanian phonetics and prosody, not to reproduce or distribute
copyrighted content.
**You must not use this dataset to:**
- Redistribute or reconstruct the original copyrighted audio material
- Train commercially deployed models
- Distribute the dataset or derivatives beyond personal research
If you are a rights holder and have concerns, please open an issue on the repository.
## Usage
```python
from datasets import load_dataset
# Load full dataset
ds = load_dataset("eduardem/romanian-speech-v2")
# Filter by speaker
elena = ds["train"].filter(lambda x: x["speaker"] == "Elena")
# Filter by gender
female = ds["train"].filter(lambda x: x["gender"] == "female")
# Stream without full download
ds = load_dataset("eduardem/romanian-speech-v2", streaming=True)
for sample in ds["train"]:
print(sample["speaker"], sample["text"][:80])
break
```
## Environment and Replication
| Component | Version / Details |
|-----------|-------------------|
| GPU | NVIDIA RTX 3090 (24 GB VRAM) |
| Python | 3.12.3 |
| faster-whisper | 1.2.1 (Whisper large-v3, CUDA float16) |
| PyTorch | 2.10.0 |
| ECAPA2 | Jenthe/ECAPA2 (TorchScript, 192-dim) |
| WeSpeaker | ResNet293-LM (ONNX, 256-dim) |
| onnxruntime-gpu | 1.24.3 |
| librosa | 0.11.0 |
| scikit-learn | Agglomerative clustering |
All processing scripts are available in the project repository.
## Limitations
- **Variable recording quality**: Studio-recorded RSS segments are significantly cleaner than
audiobook recordings, which may have room ambience or compression artifacts
- **Speaker imbalance**: Mihai has 59.66 hours while Diana has only 0.01 hours (4 segments).
Consider filtering to speakers with sufficient data for your use case
- **Sample rate variation**: Ranges from 22.05 kHz to 48 kHz across speakers.
Resample to a uniform rate before training
- **Transcription accuracy**: While the Levenshtein >= 0.96 threshold catches most errors,
some segments may contain minor diacritics or word-boundary inaccuracies
## Acknowledgments
- **RSS corpus**: Adriana Stan, SPED Lab, University of Edinburgh
([paper](https://www.isca-speech.org/archive/ssw_2021/stan21_ssw.html))
- **LibriVox**: Volunteer readers Cornel Nemes and Gabriela Oprea
- **Whisper**: OpenAI ([paper](https://arxiv.org/abs/2212.04356))
- **ECAPA2**: Jenthe Thienpondt and Kris Demuynck
([paper](https://arxiv.org/abs/2401.08342))
## Citation
```bibtex
@dataset{romanian_speech_v2_2026,
title={Romanian Speech v2: A Multi-Speaker Romanian TTS Dataset},
author={eduardem},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/eduardem/romanian-speech-v2}
}
```
提供机构:
eduardem



