five

eduardem/romanian-speech-v2

收藏
Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/eduardem/romanian-speech-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ro license: other license_name: research-only-mixed task_categories: - text-to-speech - automatic-speech-recognition tags: - romanian - speech - tts - multi-speaker - research-only pretty_name: Romanian Speech v2 size_categories: - 100K<n<1M --- > **Research Use Only** — This dataset is released strictly for personal research and educational > purposes. The processing pipeline and all scripts are fully open source, but the underlying audio > originates from sources with varying copyrights. Only short fragments were used under fair use > provisions and EU Copyright Directive Art. 3 (text and data mining for scientific research). > This dataset must **not** be used for redistribution of the source material, commercial purposes, > or training commercially deployed models. No complete copyrighted works are reproduced — the > resulting segments are short utterances (1–15 seconds) from which a model learns Romanian > phonetics and prosody, not original content. # Romanian Speech v2 > **This dataset supersedes [romanian-speech-v1](https://huggingface.co/datasets/eduardem/romanian-speech-v1), > which is now deprecated.** The v1 dataset was used to fine-tune > [f5-tts-romanian](https://huggingface.co/eduardem/f5-tts-romanian) and > [xtts-v2-romanian](https://huggingface.co/eduardem/xtts-v2-romanian). > After training, we discovered audible artifacts in both models — truncated words, misaligned > sentence boundaries, and inconsistent speaker labeling — traced back to quality issues in the > v1 dataset. This v2 release is a complete rebuild with a stricter pipeline: dual-pass Whisper > transcription with Levenshtein validation (>= 0.96), ensemble speaker clustering (ECAPA2 + > WeSpeaker), F0-based gender verification, and multi-stage audio quality filters. These changes > eliminate the artifact classes found in v1. A curated multi-speaker Romanian speech dataset for TTS model fine-tuning and speech research. | | | |---|---| | **Segments** | 247,244 | | **Total duration** | 360.16 hours | | **Speakers** | 21 (11 female / 10 male) | | **Language** | Romanian (ro) | | **Audio format** | WAV, 16-bit, mono | | **Segment duration** | 1–15 seconds | ## Speakers | Speaker | Gender | Segments | Duration | Sample Rate | Bitrate | Source | |---------|--------|----------|----------|-------------|---------|--------| | Adrian | Male | 30,111 | 45.18 h | 32 kHz | 512 kbps | Audiobook | | Ana | Female | 9,225 | 11.50 h | 22.05 kHz | 352 kbps | Audiobook | | Andreea | Female | 21,209 | 31.60 h | 44.1 kHz | 705 kbps | Audiobook | | Andrei | Male | 5,487 | 8.67 h | 22.05 kHz | 352 kbps | LibriVox | | Bogdan | Male | 1,644 | 2.61 h | 22.05 kHz | 352 kbps | Audiobook | | Catalin | Male | 220 | 0.34 h | 44.1 kHz | 705 kbps | Audiobook | | Ciprian | Male | 26,712 | 38.12 h | 22.05 kHz | 352 kbps | Audiobook | | Cristina | Female | 322 | 0.58 h | 22.05 kHz | 352 kbps | Audiobook | | Diana | Female | 4 | 0.01 h | 44.1 kHz | 705 kbps | Audiobook | | Dragos | Male | 6,464 | 10.28 h | 48 kHz | 768 kbps | Audiobook | | Elena | Female | 3,611 | 4.15 h | 48 kHz | 768 kbps | RSS Corpus | | Ioana | Female | 503 | 0.87 h | 44.1 kHz | 705 kbps | Audiobook | | Maria | Female | 1,177 | 2.01 h | 22.05 kHz | 352 kbps | LibriVox | | Marius | Male | 6,357 | 8.43 h | 48 kHz | 768 kbps | Audiobook | | Mihaela | Female | 18,584 | 25.33 h | 44.1 kHz | 705 kbps | Audiobook | | Mihai | Male | 36,775 | 59.66 h | 22.05 kHz | 352 kbps | Audiobook | | Radu | Male | 17,704 | 26.27 h | 44.1 kHz | 705 kbps | Audiobook | | Raluca | Female | 28,169 | 37.10 h | 48 kHz | 768 kbps | Audiobook | | Simona | Female | 1,915 | 2.37 h | 44.1 kHz | 705 kbps | Audiobook | | Stefan | Male | 8,101 | 13.53 h | 22.05 kHz | 352 kbps | Audiobook | | Vasile | Male | 22,950 | 31.55 h | 44.1 kHz | 705 kbps | Audiobook | ## Methodology This section describes the full processing pipeline so that it can be replicated. All scripts are open source. ### Step 0 — Cleanup and Deduplication - Delete junk files and directories (`@eaDir`, `.nfo`, `.jpg`, `.txt`, `.DS_Store`) - Compute SHA-256 hash for every audio file - Remove exact duplicates (keep first alphabetically) ### Step 1 — Audio Inventory - Run `ffprobe` on every audio file to extract duration, sample rate, channels, and bitrate - Reject corrupt files and files shorter than 10 seconds (RSS corpus exempted — its files are pre-segmented utterances of 2–5 seconds) - For M4B audiobooks: extract individual chapters via ffprobe chapter metadata and inventory each separately ### Step 2 — Speaker Identification Speaker clustering uses an **ensemble of two embedding models** to maximize accuracy: 1. **ECAPA2** ([Jenthe/ECAPA2](https://huggingface.co/Jenthe/ECAPA2)) — TorchScript model, 192-dimensional embeddings, 0.34% EER on VoxCeleb 2. **WeSpeaker ResNet293-LM** — ONNX model via `onnxruntime-gpu`, 256-dimensional embeddings, 0.53% EER on VoxCeleb For each audio file, a 30–60 second sample from the middle is extracted and embedded by both models. Cosine distance matrices from each model are min-max normalized and averaged. Agglomerative clustering (average linkage) is applied with a distance threshold of 0.25 (validated on a stable plateau at 0.25–0.35). Known speakers from public datasets (RSS, LibriVox) are assigned directly. Speaker gender is verified using F0 (fundamental frequency) analysis. Low-quality clusters (score < 60) and singleton speakers (only 1 source file) are excluded — 11 speakers dropped, 21 retained. ### Step 3 — Transcription - Convert source audio to 16 kHz mono WAV (temporary, on local SSD) - Transcribe with **Whisper large-v3** via [faster-whisper](https://github.com/SYSTRAN/faster-whisper) on CUDA (float16) - Settings: `language="ro"`, `beam_size=5`, `vad_filter=True`, `word_timestamps=True` - Output: per-file JSON with word-level timestamps - For RSS corpus: existing text transcripts from the corpus are also loaded for later comparison ### Step 4 — Sentence Splitting Segments are split at sentence boundaries using word-level timestamps from Whisper: - **Primary method**: Split at sentence-ending punctuation (`.`, `?`, `!`), respecting Romanian abbreviations (`d-le`, `nr.`, `str.`, `prof.`, etc.) to avoid false splits - **Fallback**: When sentence splitting produces fewer than 30% usable segments (common with poetry or text without punctuation), fall back to Whisper's natural VAD-based segments - **RSS corpus**: Pre-segmented utterances are copied directly (already clean single sentences) - **Filters**: Reject segments shorter than 1 second or longer than 15 seconds, or with fewer than 2 words - **Silence padding**: 300 ms of silence appended to each segment ### Step 5 — Transcript Validation Every segment is **re-transcribed independently** with Whisper large-v3, and the new transcription is compared against the original (from Step 3/4): 1. Both texts are normalized: lowercased, punctuation stripped, Romanian diacritics normalized (cedilla forms `ţ/ş` converted to comma-below forms `ț/ș`) 2. Normalized Levenshtein similarity ratio is computed 3. **Segments with ratio < 0.96 are rejected** — this catches truncated words, hallucinated text, misaligned boundaries, and segments where Whisper was uncertain 4. This dual-pass approach (transcribe the full file, split, then re-transcribe segments) provides a strong consistency check that neither pass alone can achieve ### Step 6 — Quality Filters Each segment passes three audio-quality checks: | Filter | Threshold | Method | |--------|-----------|--------| | Signal-to-noise ratio | >= 15 dB | Energy-based SNR estimation | | Clipping | < 1% of samples | Count samples at max amplitude | | Silence ratio | < 50% | Fraction of frames below silence threshold | Segments failing any check are dropped. Spectral flatness is logged but not used as a hard filter. ### Pipeline Funnel ``` 7,995 source files (628 hours raw audio) │ ├─ Step 1: Inventory & quality gate ▼ 7,455 files transcribed (Step 3) │ ├─ Step 4: Sentence splitting + segment extraction ▼ 325,722 segments │ ├─ Step 5: Whisper re-transcription + Levenshtein >= 0.96 ▼ 253,346 segments (after validation + speaker exclusion) │ ├─ Step 6: SNR, clipping, silence filters ▼ 247,244 segments (final dataset — 360.16 hours) ``` Overall acceptance rate: **75.9%** of extracted segments pass all quality checks. ## Dataset Structure | Column | Type | Description | |--------|------|-------------| | `audio` | Audio | WAV audio (16-bit mono, variable sample rate) | | `text` | string | Romanian transcript | | `speaker` | string | Speaker name | | `gender` | string | `male` or `female` | | `duration` | float | Duration in seconds | | `source` | string | `audiobook`, `rss`, or `librivox` | ## Sources | Source | License | Description | |--------|---------|-------------| | **RSS corpus v0.8.1** | CC BY-SA 4.0 | Studio recordings from University of Edinburgh, hemianechoic chamber, 48 kHz | | **LibriVox Romanian** | Public Domain | Volunteer-read audiobooks (2 speakers used) | | **Audiobooks** | Various / Fair Use | Short fragments used under EU Copyright Directive Art. 3 | ## Legal Notice This dataset was created for **personal scientific research** under the following legal basis: - **RSS corpus**: Released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). Attribution: Adriana Stan, University of Edinburgh. - **LibriVox recordings**: Public domain. No restrictions. - **Audiobook fragments**: Used under the **EU Copyright Directive (2019/790) Article 3**, which explicitly permits text and data mining for scientific research purposes. Only short fragments (1–15 seconds) are included — no complete works are reproduced. The dataset is used to learn Romanian phonetics and prosody, not to reproduce or distribute copyrighted content. **You must not use this dataset to:** - Redistribute or reconstruct the original copyrighted audio material - Train commercially deployed models - Distribute the dataset or derivatives beyond personal research If you are a rights holder and have concerns, please open an issue on the repository. ## Usage ```python from datasets import load_dataset # Load full dataset ds = load_dataset("eduardem/romanian-speech-v2") # Filter by speaker elena = ds["train"].filter(lambda x: x["speaker"] == "Elena") # Filter by gender female = ds["train"].filter(lambda x: x["gender"] == "female") # Stream without full download ds = load_dataset("eduardem/romanian-speech-v2", streaming=True) for sample in ds["train"]: print(sample["speaker"], sample["text"][:80]) break ``` ## Environment and Replication | Component | Version / Details | |-----------|-------------------| | GPU | NVIDIA RTX 3090 (24 GB VRAM) | | Python | 3.12.3 | | faster-whisper | 1.2.1 (Whisper large-v3, CUDA float16) | | PyTorch | 2.10.0 | | ECAPA2 | Jenthe/ECAPA2 (TorchScript, 192-dim) | | WeSpeaker | ResNet293-LM (ONNX, 256-dim) | | onnxruntime-gpu | 1.24.3 | | librosa | 0.11.0 | | scikit-learn | Agglomerative clustering | All processing scripts are available in the project repository. ## Limitations - **Variable recording quality**: Studio-recorded RSS segments are significantly cleaner than audiobook recordings, which may have room ambience or compression artifacts - **Speaker imbalance**: Mihai has 59.66 hours while Diana has only 0.01 hours (4 segments). Consider filtering to speakers with sufficient data for your use case - **Sample rate variation**: Ranges from 22.05 kHz to 48 kHz across speakers. Resample to a uniform rate before training - **Transcription accuracy**: While the Levenshtein >= 0.96 threshold catches most errors, some segments may contain minor diacritics or word-boundary inaccuracies ## Acknowledgments - **RSS corpus**: Adriana Stan, SPED Lab, University of Edinburgh ([paper](https://www.isca-speech.org/archive/ssw_2021/stan21_ssw.html)) - **LibriVox**: Volunteer readers Cornel Nemes and Gabriela Oprea - **Whisper**: OpenAI ([paper](https://arxiv.org/abs/2212.04356)) - **ECAPA2**: Jenthe Thienpondt and Kris Demuynck ([paper](https://arxiv.org/abs/2401.08342)) ## Citation ```bibtex @dataset{romanian_speech_v2_2026, title={Romanian Speech v2: A Multi-Speaker Romanian TTS Dataset}, author={eduardem}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/eduardem/romanian-speech-v2} } ```
提供机构:
eduardem
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作