HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: cc-by-4.0
pretty_name: "Arabic MSA 25K — Saudi Male (Tashkeel)"
size_categories:
- 10K<n<100K
task_categories:
- text-to-speech
- automatic-speech-recognition
- audio-classification
tags:
- arabic
- msa
- modern-standard-arabic
- saudi-arabic
- male-voice
- tashkeel
- diacritized
- harakat
- speech-synthesis
- tts
- asr
- audio
- single-speaker
configs:
- config_name: default
data_files:
- split: train
path: "data/train-*.parquet"
---
# Arabic MSA 25K — Saudi Male (Tashkeel)
> **25,000 fully-diacritized Arabic MSA text + audio pairs, rendered with a single
> Saudi male neural voice at 48 kHz / 16-bit PCM, across 10 thematic categories.**
---
## Dataset Summary
`arabic-msa-25k-saudi-male-tashkeel` is a **25,000-clip Modern Standard Arabic (MSA)
speech corpus** with matching diacritized text (full tashkeel / ḥarakāt). Every clip
is synthesized by the single voice `ar-SA-HamedNeural` (Azure Neural TTS, Saudi
Arabic male) at **48 kHz, 16-bit, mono PCM WAV** — ~60.5 hours of audio in total.
The text was generated by GPT-4o-mini under strict rules (MSA only, no dialect,
full tashkeel, 13–45 words, no Quranic or poetic content), then synthesized by
Azure Speech across two independent regional resources for throughput. A rich
per-clip metadata record is provided, including diacritized text, stripped
(non-diacritized) text, topic category, word and character counts, tashkeel
density, and WAV duration.
| Quick facts | Value |
|---|---|
| Total clips | **25,000** |
| Total audio | **60.54 h** |
| Average clip | 8.72 s |
| Speaker | Single: `ar-SA-HamedNeural` (Saudi male, neural TTS) |
| Language | Arabic — Modern Standard (MSA) with full tashkeel |
| Sample rate | 48,000 Hz |
| Bit depth / channels | 16-bit / mono |
| Average tashkeel density | **0.78** (tashkeel characters ÷ Arabic letters) |
| Categories | 10 (balanced, 2,500 clips each) |
| Disk size | ~19.5 GB (WAV) |
| License | CC-BY-4.0 |
---
## Supported Tasks & Use Cases
| Task | How to use this dataset |
|---|---|
| **Text-to-Speech (TTS) fine-tuning** | Train / adapt a TTS model on a consistent single-voice Saudi MSA corpus. Paired ⟨text with tashkeel, 48 kHz WAV⟩ at scale. |
| **Automatic Speech Recognition (ASR) training / evaluation** | Use MSA ⟨audio, transcript⟩ pairs with both diacritized and stripped text variants; 60 h is a non-trivial fine-tuning budget for small/mid ASR. |
| **Diacritization evaluation** | Use `text_stripped` as input, `text` as target. Forces a model to predict tashkeel from context. |
| **Voice cloning / speaker adaptation reference** | Single-speaker, studio-quality reference set for comparing clones of Saudi male MSA speakers. |
| **Arabic speech emotion / prosody research** | Baseline of a neutral single-voice register — useful as a "no-emotion" control against expressive corpora. |
| **Audio length / readability regression** | Correlate word count ⟶ audio duration ⟶ character count for MSA at scale. |
> ⚠️ This dataset is **not** a replacement for human-recorded speech corpora. It is
> a synthetic dataset produced by a commercial neural TTS system. See
> **Considerations** below for how that affects downstream training.
---
## Languages
Modern Standard Arabic (MSA), **Saudi accent** (spoken via the `ar-SA` voice).
All text is fully diacritized (tashkeel): fatḥa, ḍamma, kasra, sukūn, shadda, tanwīn.
```text
text: تُعَدُّ الطَاقَةُ الشَّمْسِيَّةُ إِحْدَى أَنْظَفِ مَصَادِرِ الطَاقَةِ الْمُتَجَدِّدَةِ …
text_stripped: تعد الطاقة الشمسية إحدى أنظف مصادر الطاقة المتجددة …
```
---
## Dataset Structure
### Files
```
├── README.md # this card
├── manifest.json # aggregate corpus stats
├── metadata.jsonl # 25K metadata rows (provided for manual inspection)
└── data/
├── train-00000-of-00005.parquet # rows 0 – 4,999 (~3.5 GB each)
├── train-00001-of-00005.parquet # rows 5,000 – 9,999
├── train-00002-of-00005.parquet # rows 10,000 – 14,999
├── train-00003-of-00005.parquet # rows 15,000 – 19,999
└── train-00004-of-00005.parquet # rows 20,000 – 24,999
```
Each Parquet shard **embeds the 48-kHz WAV bytes inline** using HuggingFace's
`Audio` feature — there are no external `.wav` files to fetch. The `datasets`
library decodes them lazily on access.
### Data Instances
A decoded row from the dataset:
```python
{
"audio": {
"path": "hamed_000042.wav",
"array": array([...], dtype=float32), # shape (num_samples,), 48 kHz
"sampling_rate": 48000
},
"id": "hamed_000042",
"text": "يُعَدُّ الذَّكَاءُ الاِصْطِنَاعِيُّ فَرْعًا مِنْ فُرُوعِ عِلْمِ الْحَاسُوبِ…",
"text_stripped": "يعد الذكاء الاصطناعي فرعا من فروع علم الحاسوب…",
"category": "technology_ai",
"voice": "ar-SA-HamedNeural",
"gender": "male",
"word_count": 28,
"char_count": 182,
"tashkeel_density": 0.34,
"audio_duration_s": 8.12,
}
```
### Data Fields
| Field | Type | Description |
|---|---|---|
| `audio` | `Audio(48000)` | 16-bit mono PCM at 48 kHz, embedded bytes in Parquet; decoded lazily |
| `id` | `string` | Deterministic clip id `hamed_NNNNNN` |
| `text` | `string` | Fully diacritized MSA text (the source used for TTS synthesis) |
| `text_stripped` | `string` | Same text with all tashkeel characters removed |
| `category` | `string` | One of the 10 topic categories (see below) |
| `voice` | `string` | Always `ar-SA-HamedNeural` |
| `gender` | `string` | Always `male` |
| `word_count` | `int32` | Arabic word count (post-tashkeel-strip) |
| `char_count` | `int32` | Raw character count (including tashkeel marks) |
| `tashkeel_density` | `float32` | # tashkeel marks ÷ # Arabic letters — typically 0.60–0.85 |
| `audio_duration_s` | `float32` | Clip duration in seconds, parsed from the WAV header |
### Categories (10 × 2,500 clips each)
| Category | Topic (Arabic description) |
|---|---|
| `news_bulletin` | نشرة أخبار عامة (سياسية، اقتصادية، رياضية، علمية) |
| `science_explainer` | شرح علمي مبسط |
| `health_wellness` | نصائح صحية وغذائية |
| `technology_ai` | تقنية، برمجة، ذكاء اصطناعي |
| `nature_environment` | بيئة، مناخ، حيوانات ونباتات |
| `history_geography` | تاريخ وجغرافيا |
| `commerce_business` | أعمال، إدارة، تسويق |
| `education_learning` | تعليم وتطوير ذاتي |
| `culture_heritage` | ثقافة وتراث |
| `daily_life_lifestyle` | حياة يومية وعادات |
### Splits
The dataset ships as a **single `train` split of 25,000 rows**. Downstream users
are free to carve out validation / test splits; we recommend stratifying by
`category` to preserve topic balance.
### Loading
```python
from datasets import load_dataset, Audio
ds = load_dataset("HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel", split="train")
# Audio feature auto-decodes on access
ds = ds.cast_column("audio", Audio(sampling_rate=48000))
row = ds[0]
print(row["text"][:80])
print(row["audio"]["array"].shape, row["audio"]["sampling_rate"])
# (418560,) 48000 → ~8.7 s at 48 kHz
# Stream (no local download of the full 20 GB):
ds_stream = load_dataset("HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel",
split="train", streaming=True)
for row in ds_stream.take(3):
print(row["id"], row["category"], row["text"][:60])
```
---
## Dataset Creation
### Motivation
High-quality Arabic speech data is under-represented relative to English. Even
within Arabic, **Saudi MSA with full tashkeel** and controlled topic diversity is
rarely available at this scale. The goal of this release is to provide a clean,
legally-unencumbered, synthetic speech corpus that:
1. Uses a single consistent voice (enables speaker-conditional studies),
2. Guarantees full tashkeel coverage on every sample (enables diacritization
research and ASR that produces diacritized output),
3. Balances topic categories (no news-only bias), and
4. Is large enough (~60 h) to meaningfully fine-tune small-to-mid ASR / TTS.
### Generation pipeline
```
┌─────────────────────────┐
│ Stage 1 — text gen │ Azure OpenAI gpt-4o-mini
│ 25,000 unique MSA texts │ temperature 0.6, 40 concurrent
│ 10 categories × 2,500 │ strict validators: MSA-only,
│ Full tashkeel │ tashkeel density, word count
│ │ dedup by SHA-256 of text
└─────────────┬───────────┘
│
▼
┌─────────────────────────┐
│ Stage 2 — TTS │ Azure Speech (ar-SA-HamedNeural)
│ 25,000 × WAV 48 kHz │ 80 concurrent across two independent
│ 16-bit mono PCM │ regional resources; exponential backoff
│ │ on HTTP 429
└─────────────┬───────────┘
│
▼
┌─────────────────────────┐
│ Stage 3 — manifest │ Aggregate metadata into manifest.json
│ │ (duration, tashkeel density, …)
└─────────────────────────┘
```
### Source Data
| Component | Source |
|---|---|
| Text | Synthetic, generated by [GPT-4o-mini](https://learn.microsoft.com/azure/ai-services/openai) under strict system-prompt constraints (MSA only, 13–45 words, no Quran / no poetry / no dialect / no Latin / minimal digits) |
| Audio | Synthesized by [Azure Speech Service](https://learn.microsoft.com/azure/ai-services/speech-service) — neural voice `ar-SA-HamedNeural` (Saudi male), format `riff-48khz-16bit-mono-pcm` |
### Validation & Quality Gates
Every text had to pass all of the following before being synthesized:
- **Tashkeel density** (observed mean **0.78**).
- **Word count** ∈ [13, 45] MSA words.
- **No dialect markers** — rejection list includes common Saudi / Egyptian /
Levantine / Maghrebi tokens.
- **No Quranic signals** — rejection on ﴿﴾ markers or `بسم الله الرحمن الرحيم`,
`صلى الله عليه وسلم`, etc.
- **Latin / digit caps** to keep the text purely Arabic.
- **SHA-256 dedup** across categories and generation chunks.
Approximately 43 % of raw model outputs were rejected by these validators, which
is why the corpus is synthetic but relatively clean.
### Annotations
All annotations (`category`, `word_count`, `tashkeel_density`, `audio_duration_s`)
are programmatic — either direct from the generation pipeline or parsed from WAV
headers. **No human annotation was performed.**
### Personal and Sensitive Information
None by construction:
- Texts were freshly generated by an LLM with no conditioning on personal data.
- Audio is synthesized by a studio voice and contains no real speaker.
- No user identifiers, locations, or PII appear in the pipeline.
If any PII or sensitive content slips through the generation filter, please
open an issue on the dataset page and we will remove it.
---
## Considerations for Using the Data
### Known Limitations
- **Synthetic source** — text and audio are both model-generated. Models that
train solely on this corpus will inherit any systematic biases of GPT-4o-mini
(content bias) and Azure `ar-SA-HamedNeural` (prosodic / phonetic bias).
- **Average word count is 14.6** — slightly below the 20-word intent of the
original spec. Downstream users needing longer utterances should concatenate
or augment.
- **Saudi-accented MSA**, not pan-Arab-neutral MSA. Phonetic realisations of
`ق` / `ض` / `ج` reflect the Saudi voice. For other accents, pair with
equivalent generation using `ar-EG-*`, `ar-AE-*`, `ar-JO-*`, etc.
- **Single voice** — this corpus cannot be used on its own for multi-speaker
TTS / ASR tasks that require speaker diversity.
- **No emotional / expressive variation** — the Azure neural voice is used in
its default register. All clips are in a neutral tone.
- **Tashkeel from a generative model** — while density is high (~0.78), the
tashkeel itself has not been individually verified by human linguists. For
high-stakes linguistic research (e.g. grammatical case ṣabṭ), cross-check
with an MSA tashkeel reference tool.
- **No Whisper round-trip WER** is included in this release; Azure OpenAI
Whisper rate-limits made a 25K round-trip impractical. A partial WER may be
appended in a future revision.
### Biases
- **Topic bias** — the 10 chosen categories (news, science, health, tech,
nature, history, commerce, education, culture, daily life) are themselves a
design choice. They omit e.g. sports, politics-heavy commentary, legal
language, and dialectal casual speech.
- **Register bias** — all clips are formal MSA. Real-world Arabic AI use cases
(chat, voice assistants) often need dialectal or colloquial data; this
corpus is *not* that.
### Other
The rejection filter explicitly refused Quranic, ḥadīth, and poetic text to
avoid releasing TTS renderings of sacred or metrical text without appropriate
context. If your application specifically requires religious or poetic audio,
use a more suitable dedicated corpus.
---
## Licensing
**This dataset is released under the [Creative Commons Attribution 4.0
International license (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/).**
Users must comply with:
- [Azure OpenAI Code of Conduct](https://learn.microsoft.com/legal/cognitive-services/openai/code-of-conduct),
- [Azure AI terms for synthetic speech output](https://learn.microsoft.com/legal/cognitive-services/speech-service/transparency-note-synthetic-voice),
- And attribute this dataset + the generation tooling in downstream derivatives.
---
## Citation
If you use this dataset, please cite it as:
```bibtex
@misc{haroon_arabic_msa_25k_saudi_male_2026,
title = {Arabic MSA 25K — Saudi Male (Tashkeel)},
author = {Hesham Haroon},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel}},
note = {25,000 fully-diacritized Arabic MSA clips synthesized with Azure ar-SA-HamedNeural}
}
```
---
## Contact & Issues
Found an issue, want a re-run on a different voice, need more clips, or
concerned about specific content? Please open an issue on the dataset page or
reach out on Hugging Face.
提供机构:
HeshamHaroon



