MrEzzat/arabic-eou-detection-10k
收藏Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/MrEzzat/arabic-eou-detection-10k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: mit
size_categories:
- 1K<n<10K
task_categories:
- text-classification
task_ids:
- sentiment-classification
tags:
- arabic
- saudi-dialect
- end-of-utterance
- eou-detection
- voice-agent
- conversational-ai
- najdi-dialect
pretty_name: Arabic End-of-Utterance Detection Dataset (Saudi Dialect)
---
# Arabic End-of-Utterance (EOU) Detection Dataset
## Dataset Description
This dataset contains **10,000 high-quality Arabic utterances** specifically designed for training End-of-Utterance (EOU) detection models for voice agents and conversational AI systems, with a focus on **Saudi Arabic (Najdi dialect)**.
### Dataset Summary
- **Language:** Arabic (Saudi Najdi dialect)
- **Task:** Binary text classification (EOU detection)
- **Size:** 10,000 samples
- **Splits:** Train (70%), Validation (15%), Test (15%)
- **Quality Score:** 85.8/100
### Supported Tasks
- **End-of-Utterance Detection:** Classify whether an utterance is complete (EOU) or incomplete (non-EOU)
- **Voice Agent Development:** Train models for real-time EOU detection in conversational AI
- **Saudi Arabic NLP:** Fine-tune models for Saudi dialect understanding
## Dataset Structure
### Data Instances
Each instance contains:
- `utterance`: The Arabic text utterance
- `style`: One of `informal`, `formal`, or `asr_like`
- `label`: Binary label (1 = EOU/complete, 0 = non-EOU/incomplete)
Example:
```json
{
"utterance": "هل أقدر أحجز طاولة اليوم؟",
"style": "formal",
"label": 1
}
```
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `utterance` | string | Arabic text utterance (Saudi dialect) |
| `style` | string | Style of utterance: `informal` (Najdi slang), `formal` (MSA-infused), `asr_like` (simulated ASR imperfections) |
| `label` | int | Binary label: `1` = EOU (complete utterance), `0` = non-EOU (incomplete utterance) |
### Data Splits
| Split | Samples | Percentage |
|-------|---------|------------|
| Train | 7,000 | 70% |
| Validation | 1,500 | 15% |
| Test | 1,500 | 15% |
## Dataset Creation
### Source Data
This dataset was synthetically generated using large language models (LLMs) with carefully engineered prompts to ensure:
- Authentic Saudi Arabic (Najdi dialect) patterns
- Balanced label distribution (60% EOU, 40% non-EOU)
- Zero ellipsis bias (no punctuation crutches)
- High vocabulary diversity
- Realistic incomplete utterances
### Data Collection Process
1. **Prompt Engineering:** Designed expert-level system prompts with EOU-aware generation rules
2. **Multi-Style Generation:** Created three distinct styles:
- **Informal:** Natural Saudi Najdi slang and Gulf phrasing
- **Formal:** MSA-infused Saudi Arabic for professional contexts
- **ASR-like:** Simulated ASR imperfections (vowel drops, character swaps, word merges)
3. **Quality Validation:** Rigorous quality checks for duplicates, bias patterns, and label distribution
4. **Stratified Splitting:** Train/val/test splits maintain label and style distributions
### Annotations
The dataset uses synthetic annotations generated by LLMs with the following labeling rules:
**EOU (label=1) - Complete Utterances:**
- Complete questions: "متى يوصل الطلب؟" (When will the order arrive?)
- Complete statements: "أنا موافق على الشروط" (I agree to the terms)
- Complete requests: "ممكن تساعدني؟" (Can you help me?)
**Non-EOU (label=0) - Incomplete Utterances:**
- Trailing phrases: "بس لازم نتفق أول" (But we need to agree first)
- Incomplete questions: "هل تقدر تشوف" (Can you see...)
- Mid-thought fillers: "يعني أنا أقصد" (I mean I mean...)
- Trailing conjunctions: "خلاص فهمت، بس" (Okay I understood, but...)
## Dataset Statistics
### Label Distribution
- **EOU (label=1):** 6,055 samples (60.55%)
- **Non-EOU (label=0):** 3,945 samples (39.45%)
### Style Distribution
- **Formal:** 4,163 samples (41.63%)
- **Informal:** 3,674 samples (36.74%)
- **ASR-like:** 2,163 samples (21.63%)
### Quality Metrics
- **Duplicates:** 5.55% (555 samples)
- **Unique last words:** 3,885 (38.85% of dataset)
- **Ellipsis bias:** 0% (no punctuation crutches)
- **Average word count:** 5.77 words
- **Average character count:** 28.91 characters
### Domain Coverage
The dataset covers 8 conversation domains:
- Restaurant (reservations, food ordering)
- Banking (account inquiries, transactions)
- Hospitality (hotel bookings, travel)
- Healthcare (appointments, health inquiries)
- Social (friends, family conversations)
- Retail (shopping, negotiations)
- Transportation (car rental, rides)
- Professional (job interviews, work discussions)
## Intended Uses
### Primary Use Cases
1. **Fine-tuning EOU detection models** for Arabic voice agents
2. **Training real-time conversational AI** systems for Saudi market
3. **Benchmarking Arabic NLP models** on EOU detection task
4. **LiveKit agent integration** for production voice applications
### Out-of-Scope Uses
- General Arabic language modeling (dataset is specific to EOU detection)
- Non-Saudi Arabic dialects (optimized for Najdi dialect)
- Long-form text classification (utterances are short, 1-12 words)
## Limitations
1. **Synthetic Data:** Generated by LLMs, not human-annotated
2. **Duplicate Rate:** 5.55% duplicates (above ideal <1% threshold)
3. **Style-Label Imbalance:** Formal style is 93.7% EOU, ASR-like is 84.5% non-EOU (reflects realistic patterns)
4. **Dialect Specificity:** Optimized for Saudi Najdi dialect, may not generalize to other Arabic dialects
## Ethical Considerations
- **Synthetic Generation:** No personally identifiable information (PII)
- **Cultural Sensitivity:** Avoids real names, brands, or sensitive topics
- **Bias Mitigation:** Actively eliminates punctuation bias and word-based crutches
- **Transparency:** Full generation process documented
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{arabic_eou_detection_10k,
title={Arabic End-of-Utterance Detection Dataset (Saudi Dialect)},
author={MrEzzat},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/MrEzzat/arabic-eou-detection-10k}
}
```
## License
This dataset is released under the **MIT License**.
## Dataset Card Authors
- **Created by:** MrEzzat
- **Date:** December 2025
- **Version:** 1.0
## Additional Information
### Dataset Curators
This dataset was created as part of the HAMS (Arabic EOU Detection) project for LiveKit voice agent integration.
### Funding
Self-funded research project.
### Contact
For questions or feedback, please open an issue on the [GitHub repository](https://github.com/Ahmed-Ezzat20/hams_task).
---
**Quality Score:** 85.8/100 (GOOD - Ready for training)
**Status:** Production-ready for fine-tuning Arabic EOU detection models.
提供机构:
MrEzzat



