five

NahwAI/arabic-tashkeel-speech

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/NahwAI/arabic-tashkeel-speech
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: cc-by-4.0 task_categories: - automatic-speech-recognition annotations_creators: - crowdsourced language_creators: - crowdsourced multilinguality: - monolingual size_categories: - 1K<n<10K pretty_name: "Nahw Arabic Tashkeel Speech Dataset" tags: - audio - arabic - speech - asr - tashkeel - diacritics dataset_info: features: - name: audio dtype: audio - name: transcription dtype: string - name: sentence dtype: string - name: speaker_id dtype: string splits: - name: train num_examples: 1093 configs: - config_name: default data_files: - split: train path: data/train-* --- # Nahw Arabic Tashkeel Speech Dataset An open-source collection of **1,093** fully diacritized Arabic speech recordings, crowd-sourced from native speakers via [Nahw.ai](https://nahw.ai). ## Dataset summary | Stat | Value | |------|-------| | Total recordings | 1,093 | | Speakers | 10 | | Language | Arabic (ar) | | Sampling rate | 16 kHz | | License | CC-BY-4.0 | ## Features - **audio**: The speech recording, resampled to 16 kHz. - **transcription**: The fully diacritized Arabic sentence that was read aloud. - **sentence**: The same sentence without diacritics (tashkeel removed). - **speaker_id**: An anonymized speaker identifier. ## Usage ```python from datasets import load_dataset ds = load_dataset("NahwAI/arabic-tashkeel-speech") print(ds["train"][0]) ``` ## Data collection Native Arabic speakers recorded sentences through the Nahw.ai platform. Each recording was reviewed and approved by a human annotator before inclusion in this dataset. ## License This dataset is released under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). ## Citation ```bibtex @dataset{nahw_arabic_speech_2026, title={Nahw Arabic Tashkeel Speech Dataset}, author={Nahw.ai}, year={2026}, url={https://huggingface.co/datasets/NahwAI/arabic-tashkeel-speech}, license={CC-BY-4.0} } ```
提供机构:
NahwAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作