five

HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: cc-by-4.0 pretty_name: "Arabic MSA 25K — Saudi Male (Tashkeel)" size_categories: - 10K<n<100K task_categories: - text-to-speech - automatic-speech-recognition - audio-classification tags: - arabic - msa - modern-standard-arabic - saudi-arabic - male-voice - tashkeel - diacritized - harakat - speech-synthesis - tts - asr - audio - single-speaker configs: - config_name: default data_files: - split: train path: "data/train-*.parquet" --- # Arabic MSA 25K — Saudi Male (Tashkeel) > **25,000 fully-diacritized Arabic MSA text + audio pairs, rendered with a single > Saudi male neural voice at 48 kHz / 16-bit PCM, across 10 thematic categories.** --- ## Dataset Summary `arabic-msa-25k-saudi-male-tashkeel` is a **25,000-clip Modern Standard Arabic (MSA) speech corpus** with matching diacritized text (full tashkeel / ḥarakāt). Every clip is synthesized by the single voice `ar-SA-HamedNeural` (Azure Neural TTS, Saudi Arabic male) at **48 kHz, 16-bit, mono PCM WAV** — ~60.5 hours of audio in total. The text was generated by GPT-4o-mini under strict rules (MSA only, no dialect, full tashkeel, 13–45 words, no Quranic or poetic content), then synthesized by Azure Speech across two independent regional resources for throughput. A rich per-clip metadata record is provided, including diacritized text, stripped (non-diacritized) text, topic category, word and character counts, tashkeel density, and WAV duration. | Quick facts | Value | |---|---| | Total clips | **25,000** | | Total audio | **60.54 h** | | Average clip | 8.72 s | | Speaker | Single: `ar-SA-HamedNeural` (Saudi male, neural TTS) | | Language | Arabic — Modern Standard (MSA) with full tashkeel | | Sample rate | 48,000 Hz | | Bit depth / channels | 16-bit / mono | | Average tashkeel density | **0.78** (tashkeel characters ÷ Arabic letters) | | Categories | 10 (balanced, 2,500 clips each) | | Disk size | ~19.5 GB (WAV) | | License | CC-BY-4.0 | --- ## Supported Tasks & Use Cases | Task | How to use this dataset | |---|---| | **Text-to-Speech (TTS) fine-tuning** | Train / adapt a TTS model on a consistent single-voice Saudi MSA corpus. Paired ⟨text with tashkeel, 48 kHz WAV⟩ at scale. | | **Automatic Speech Recognition (ASR) training / evaluation** | Use MSA ⟨audio, transcript⟩ pairs with both diacritized and stripped text variants; 60 h is a non-trivial fine-tuning budget for small/mid ASR. | | **Diacritization evaluation** | Use `text_stripped` as input, `text` as target. Forces a model to predict tashkeel from context. | | **Voice cloning / speaker adaptation reference** | Single-speaker, studio-quality reference set for comparing clones of Saudi male MSA speakers. | | **Arabic speech emotion / prosody research** | Baseline of a neutral single-voice register — useful as a "no-emotion" control against expressive corpora. | | **Audio length / readability regression** | Correlate word count ⟶ audio duration ⟶ character count for MSA at scale. | > ⚠️ This dataset is **not** a replacement for human-recorded speech corpora. It is > a synthetic dataset produced by a commercial neural TTS system. See > **Considerations** below for how that affects downstream training. --- ## Languages Modern Standard Arabic (MSA), **Saudi accent** (spoken via the `ar-SA` voice). All text is fully diacritized (tashkeel): fatḥa, ḍamma, kasra, sukūn, shadda, tanwīn. ```text text: تُعَدُّ الطَاقَةُ الشَّمْسِيَّةُ إِحْدَى أَنْظَفِ مَصَادِرِ الطَاقَةِ الْمُتَجَدِّدَةِ … text_stripped: تعد الطاقة الشمسية إحدى أنظف مصادر الطاقة المتجددة … ``` --- ## Dataset Structure ### Files ``` ├── README.md # this card ├── manifest.json # aggregate corpus stats ├── metadata.jsonl # 25K metadata rows (provided for manual inspection) └── data/ ├── train-00000-of-00005.parquet # rows 0 – 4,999 (~3.5 GB each) ├── train-00001-of-00005.parquet # rows 5,000 – 9,999 ├── train-00002-of-00005.parquet # rows 10,000 – 14,999 ├── train-00003-of-00005.parquet # rows 15,000 – 19,999 └── train-00004-of-00005.parquet # rows 20,000 – 24,999 ``` Each Parquet shard **embeds the 48-kHz WAV bytes inline** using HuggingFace's `Audio` feature — there are no external `.wav` files to fetch. The `datasets` library decodes them lazily on access. ### Data Instances A decoded row from the dataset: ```python { "audio": { "path": "hamed_000042.wav", "array": array([...], dtype=float32), # shape (num_samples,), 48 kHz "sampling_rate": 48000 }, "id": "hamed_000042", "text": "يُعَدُّ الذَّكَاءُ الاِصْطِنَاعِيُّ فَرْعًا مِنْ فُرُوعِ عِلْمِ الْحَاسُوبِ…", "text_stripped": "يعد الذكاء الاصطناعي فرعا من فروع علم الحاسوب…", "category": "technology_ai", "voice": "ar-SA-HamedNeural", "gender": "male", "word_count": 28, "char_count": 182, "tashkeel_density": 0.34, "audio_duration_s": 8.12, } ``` ### Data Fields | Field | Type | Description | |---|---|---| | `audio` | `Audio(48000)` | 16-bit mono PCM at 48 kHz, embedded bytes in Parquet; decoded lazily | | `id` | `string` | Deterministic clip id `hamed_NNNNNN` | | `text` | `string` | Fully diacritized MSA text (the source used for TTS synthesis) | | `text_stripped` | `string` | Same text with all tashkeel characters removed | | `category` | `string` | One of the 10 topic categories (see below) | | `voice` | `string` | Always `ar-SA-HamedNeural` | | `gender` | `string` | Always `male` | | `word_count` | `int32` | Arabic word count (post-tashkeel-strip) | | `char_count` | `int32` | Raw character count (including tashkeel marks) | | `tashkeel_density` | `float32` | # tashkeel marks ÷ # Arabic letters — typically 0.60–0.85 | | `audio_duration_s` | `float32` | Clip duration in seconds, parsed from the WAV header | ### Categories (10 × 2,500 clips each) | Category | Topic (Arabic description) | |---|---| | `news_bulletin` | نشرة أخبار عامة (سياسية، اقتصادية، رياضية، علمية) | | `science_explainer` | شرح علمي مبسط | | `health_wellness` | نصائح صحية وغذائية | | `technology_ai` | تقنية، برمجة، ذكاء اصطناعي | | `nature_environment` | بيئة، مناخ، حيوانات ونباتات | | `history_geography` | تاريخ وجغرافيا | | `commerce_business` | أعمال، إدارة، تسويق | | `education_learning` | تعليم وتطوير ذاتي | | `culture_heritage` | ثقافة وتراث | | `daily_life_lifestyle` | حياة يومية وعادات | ### Splits The dataset ships as a **single `train` split of 25,000 rows**. Downstream users are free to carve out validation / test splits; we recommend stratifying by `category` to preserve topic balance. ### Loading ```python from datasets import load_dataset, Audio ds = load_dataset("HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel", split="train") # Audio feature auto-decodes on access ds = ds.cast_column("audio", Audio(sampling_rate=48000)) row = ds[0] print(row["text"][:80]) print(row["audio"]["array"].shape, row["audio"]["sampling_rate"]) # (418560,) 48000 → ~8.7 s at 48 kHz # Stream (no local download of the full 20 GB): ds_stream = load_dataset("HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel", split="train", streaming=True) for row in ds_stream.take(3): print(row["id"], row["category"], row["text"][:60]) ``` --- ## Dataset Creation ### Motivation High-quality Arabic speech data is under-represented relative to English. Even within Arabic, **Saudi MSA with full tashkeel** and controlled topic diversity is rarely available at this scale. The goal of this release is to provide a clean, legally-unencumbered, synthetic speech corpus that: 1. Uses a single consistent voice (enables speaker-conditional studies), 2. Guarantees full tashkeel coverage on every sample (enables diacritization research and ASR that produces diacritized output), 3. Balances topic categories (no news-only bias), and 4. Is large enough (~60 h) to meaningfully fine-tune small-to-mid ASR / TTS. ### Generation pipeline ``` ┌─────────────────────────┐ │ Stage 1 — text gen │ Azure OpenAI gpt-4o-mini │ 25,000 unique MSA texts │ temperature 0.6, 40 concurrent │ 10 categories × 2,500 │ strict validators: MSA-only, │ Full tashkeel │ tashkeel density, word count │ │ dedup by SHA-256 of text └─────────────┬───────────┘ │ ▼ ┌─────────────────────────┐ │ Stage 2 — TTS │ Azure Speech (ar-SA-HamedNeural) │ 25,000 × WAV 48 kHz │ 80 concurrent across two independent │ 16-bit mono PCM │ regional resources; exponential backoff │ │ on HTTP 429 └─────────────┬───────────┘ │ ▼ ┌─────────────────────────┐ │ Stage 3 — manifest │ Aggregate metadata into manifest.json │ │ (duration, tashkeel density, …) └─────────────────────────┘ ``` ### Source Data | Component | Source | |---|---| | Text | Synthetic, generated by [GPT-4o-mini](https://learn.microsoft.com/azure/ai-services/openai) under strict system-prompt constraints (MSA only, 13–45 words, no Quran / no poetry / no dialect / no Latin / minimal digits) | | Audio | Synthesized by [Azure Speech Service](https://learn.microsoft.com/azure/ai-services/speech-service) — neural voice `ar-SA-HamedNeural` (Saudi male), format `riff-48khz-16bit-mono-pcm` | ### Validation & Quality Gates Every text had to pass all of the following before being synthesized: - **Tashkeel density** (observed mean **0.78**). - **Word count** ∈ [13, 45] MSA words. - **No dialect markers** — rejection list includes common Saudi / Egyptian / Levantine / Maghrebi tokens. - **No Quranic signals** — rejection on ﴿﴾ markers or `بسم الله الرحمن الرحيم`, `صلى الله عليه وسلم`, etc. - **Latin / digit caps** to keep the text purely Arabic. - **SHA-256 dedup** across categories and generation chunks. Approximately 43 % of raw model outputs were rejected by these validators, which is why the corpus is synthetic but relatively clean. ### Annotations All annotations (`category`, `word_count`, `tashkeel_density`, `audio_duration_s`) are programmatic — either direct from the generation pipeline or parsed from WAV headers. **No human annotation was performed.** ### Personal and Sensitive Information None by construction: - Texts were freshly generated by an LLM with no conditioning on personal data. - Audio is synthesized by a studio voice and contains no real speaker. - No user identifiers, locations, or PII appear in the pipeline. If any PII or sensitive content slips through the generation filter, please open an issue on the dataset page and we will remove it. --- ## Considerations for Using the Data ### Known Limitations - **Synthetic source** — text and audio are both model-generated. Models that train solely on this corpus will inherit any systematic biases of GPT-4o-mini (content bias) and Azure `ar-SA-HamedNeural` (prosodic / phonetic bias). - **Average word count is 14.6** — slightly below the 20-word intent of the original spec. Downstream users needing longer utterances should concatenate or augment. - **Saudi-accented MSA**, not pan-Arab-neutral MSA. Phonetic realisations of `ق` / `ض` / `ج` reflect the Saudi voice. For other accents, pair with equivalent generation using `ar-EG-*`, `ar-AE-*`, `ar-JO-*`, etc. - **Single voice** — this corpus cannot be used on its own for multi-speaker TTS / ASR tasks that require speaker diversity. - **No emotional / expressive variation** — the Azure neural voice is used in its default register. All clips are in a neutral tone. - **Tashkeel from a generative model** — while density is high (~0.78), the tashkeel itself has not been individually verified by human linguists. For high-stakes linguistic research (e.g. grammatical case ṣabṭ), cross-check with an MSA tashkeel reference tool. - **No Whisper round-trip WER** is included in this release; Azure OpenAI Whisper rate-limits made a 25K round-trip impractical. A partial WER may be appended in a future revision. ### Biases - **Topic bias** — the 10 chosen categories (news, science, health, tech, nature, history, commerce, education, culture, daily life) are themselves a design choice. They omit e.g. sports, politics-heavy commentary, legal language, and dialectal casual speech. - **Register bias** — all clips are formal MSA. Real-world Arabic AI use cases (chat, voice assistants) often need dialectal or colloquial data; this corpus is *not* that. ### Other The rejection filter explicitly refused Quranic, ḥadīth, and poetic text to avoid releasing TTS renderings of sacred or metrical text without appropriate context. If your application specifically requires religious or poetic audio, use a more suitable dedicated corpus. --- ## Licensing **This dataset is released under the [Creative Commons Attribution 4.0 International license (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/).** Users must comply with: - [Azure OpenAI Code of Conduct](https://learn.microsoft.com/legal/cognitive-services/openai/code-of-conduct), - [Azure AI terms for synthetic speech output](https://learn.microsoft.com/legal/cognitive-services/speech-service/transparency-note-synthetic-voice), - And attribute this dataset + the generation tooling in downstream derivatives. --- ## Citation If you use this dataset, please cite it as: ```bibtex @misc{haroon_arabic_msa_25k_saudi_male_2026, title = {Arabic MSA 25K — Saudi Male (Tashkeel)}, author = {Hesham Haroon}, year = {2026}, howpublished = {\url{https://huggingface.co/datasets/HeshamHaroon/arabic-msa-25k-saudi-male-tashkeel}}, note = {25,000 fully-diacritized Arabic MSA clips synthesized with Azure ar-SA-HamedNeural} } ``` --- ## Contact & Issues Found an issue, want a re-run on a different voice, need more clips, or concerned about specific content? Please open an issue on the dataset page or reach out on Hugging Face.
提供机构:
HeshamHaroon
二维码
社区交流群
二维码
科研交流群
商业服务