NahwAI/arabic-tashkeel-speech
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/NahwAI/arabic-tashkeel-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
pretty_name: "Nahw Arabic Tashkeel Speech Dataset"
tags:
- audio
- arabic
- speech
- asr
- tashkeel
- diacritics
dataset_info:
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
- name: sentence
dtype: string
- name: speaker_id
dtype: string
splits:
- name: train
num_examples: 1093
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Nahw Arabic Tashkeel Speech Dataset
An open-source collection of **1,093** fully diacritized Arabic speech recordings, crowd-sourced from native speakers via [Nahw.ai](https://nahw.ai).
## Dataset summary
| Stat | Value |
|------|-------|
| Total recordings | 1,093 |
| Speakers | 10 |
| Language | Arabic (ar) |
| Sampling rate | 16 kHz |
| License | CC-BY-4.0 |
## Features
- **audio**: The speech recording, resampled to 16 kHz.
- **transcription**: The fully diacritized Arabic sentence that was read aloud.
- **sentence**: The same sentence without diacritics (tashkeel removed).
- **speaker_id**: An anonymized speaker identifier.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("NahwAI/arabic-tashkeel-speech")
print(ds["train"][0])
```
## Data collection
Native Arabic speakers recorded sentences through the Nahw.ai platform. Each recording was reviewed and approved by a human annotator before inclusion in this dataset.
## License
This dataset is released under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
## Citation
```bibtex
@dataset{nahw_arabic_speech_2026,
title={Nahw Arabic Tashkeel Speech Dataset},
author={Nahw.ai},
year={2026},
url={https://huggingface.co/datasets/NahwAI/arabic-tashkeel-speech},
license={CC-BY-4.0}
}
```
提供机构:
NahwAI



