nadsoft/hamsa-asr-small-21k
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nadsoft/hamsa-asr-small-21k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: apache-2.0
task_categories:
- automatic-speech-recognition
tags:
- arabic
- speech
- asr
- audio
size_categories:
- n<1K
pretty_name: Arabic ASR Dataset
---
# Arabic ASR Dataset
## Dataset Description
This dataset contains Arabic speech recordings with transcriptions for Automatic Speech Recognition (ASR) tasks.
### Dataset Statistics
- **Total Samples**: 21980
- **Train Samples**: 20880
- **Test Samples**: 1100
- **Language**: Arabic (ar)
- **Task**: Automatic Speech Recognition
- **Audio Format**: WAV (16kHz sampling rate)
### Features
| Feature | Type | Description |
|---------|------|-------------|
| `audio` | Audio | Audio recording (16kHz) |
| `text` | string | Arabic transcription |
| `gender` | string | Speaker gender (Male/Female/Unknown) |
| `eos_prediction` | int32 | End of sentence prediction (0/1) |
| `eos_probability` | float32 | Probability of end of sentence |
| `model` | string | Model used for prediction |
| `reviewed` | bool | Whether transcription has been reviewed |
| `duration` | float32 | Audio duration in seconds |
| `ignore` | bool | Whether this sample should be ignored |
### Example Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("nadsoft/hamsa-asr-small-21k")
# Access train and test splits
train_data = dataset['train']
test_data = dataset['test']
# Example: Print first sample
print(train_data[0])
# Example: Access audio and text
audio = train_data[0]['audio']['array']
text = train_data[0]['text']
print(f"Text: {text}")
```
### Data Fields
- **audio**: A dictionary containing:
- `path`: Path to the audio file
- `array`: Audio array
- `sampling_rate`: Sampling rate (16000 Hz)
- **text**: The Arabic transcription text
- **gender**: Speaker gender information
- **eos_prediction**: Binary end of sentence prediction
- **eos_probability**: Confidence score for EOS prediction
- **model**: Name of the model used
- **reviewed**: Boolean indicating if transcription was manually reviewed
- **duration**: Length of audio in seconds
- **ignore**: Boolean flag indicating if text should be ignored (text will be "no-text" when True)
### Model Information
Transcriptions were generated using: `nadsoft/Hamsa-Conversational-v1.0-mulaw`
### Citation
If you use this dataset in your research, please cite it appropriately.
### License
This dataset is licensed under Apache 2.0.
提供机构:
nadsoft



