MrEzzat/arabic-eou-detection-10k

Name: MrEzzat/arabic-eou-detection-10k
Creator: MrEzzat
Published: 2025-12-11 01:18:51
License: 暂无描述

Hugging Face2025-12-11 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/MrEzzat/arabic-eou-detection-10k

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ar license: mit size_categories: - 1K<n<10K task_categories: - text-classification task_ids: - sentiment-classification tags: - arabic - saudi-dialect - end-of-utterance - eou-detection - voice-agent - conversational-ai - najdi-dialect pretty_name: Arabic End-of-Utterance Detection Dataset (Saudi Dialect) --- # Arabic End-of-Utterance (EOU) Detection Dataset ## Dataset Description This dataset contains **10,000 high-quality Arabic utterances** specifically designed for training End-of-Utterance (EOU) detection models for voice agents and conversational AI systems, with a focus on **Saudi Arabic (Najdi dialect)**. ### Dataset Summary - **Language:** Arabic (Saudi Najdi dialect) - **Task:** Binary text classification (EOU detection) - **Size:** 10,000 samples - **Splits:** Train (70%), Validation (15%), Test (15%) - **Quality Score:** 85.8/100 ### Supported Tasks - **End-of-Utterance Detection:** Classify whether an utterance is complete (EOU) or incomplete (non-EOU) - **Voice Agent Development:** Train models for real-time EOU detection in conversational AI - **Saudi Arabic NLP:** Fine-tune models for Saudi dialect understanding ## Dataset Structure ### Data Instances Each instance contains: - `utterance`: The Arabic text utterance - `style`: One of `informal`, `formal`, or `asr_like` - `label`: Binary label (1 = EOU/complete, 0 = non-EOU/incomplete) Example: ```json { "utterance": "هل أقدر أحجز طاولة اليوم؟", "style": "formal", "label": 1 } ``` ### Data Fields | Field | Type | Description | |-------|------|-------------| | `utterance` | string | Arabic text utterance (Saudi dialect) | | `style` | string | Style of utterance: `informal` (Najdi slang), `formal` (MSA-infused), `asr_like` (simulated ASR imperfections) | | `label` | int | Binary label: `1` = EOU (complete utterance), `0` = non-EOU (incomplete utterance) | ### Data Splits | Split | Samples | Percentage | |-------|---------|------------| | Train | 7,000 | 70% | | Validation | 1,500 | 15% | | Test | 1,500 | 15% | ## Dataset Creation ### Source Data This dataset was synthetically generated using large language models (LLMs) with carefully engineered prompts to ensure: - Authentic Saudi Arabic (Najdi dialect) patterns - Balanced label distribution (60% EOU, 40% non-EOU) - Zero ellipsis bias (no punctuation crutches) - High vocabulary diversity - Realistic incomplete utterances ### Data Collection Process 1. **Prompt Engineering:** Designed expert-level system prompts with EOU-aware generation rules 2. **Multi-Style Generation:** Created three distinct styles: - **Informal:** Natural Saudi Najdi slang and Gulf phrasing - **Formal:** MSA-infused Saudi Arabic for professional contexts - **ASR-like:** Simulated ASR imperfections (vowel drops, character swaps, word merges) 3. **Quality Validation:** Rigorous quality checks for duplicates, bias patterns, and label distribution 4. **Stratified Splitting:** Train/val/test splits maintain label and style distributions ### Annotations The dataset uses synthetic annotations generated by LLMs with the following labeling rules: **EOU (label=1) - Complete Utterances:** - Complete questions: "متى يوصل الطلب؟" (When will the order arrive?) - Complete statements: "أنا موافق على الشروط" (I agree to the terms) - Complete requests: "ممكن تساعدني؟" (Can you help me?) **Non-EOU (label=0) - Incomplete Utterances:** - Trailing phrases: "بس لازم نتفق أول" (But we need to agree first) - Incomplete questions: "هل تقدر تشوف" (Can you see...) - Mid-thought fillers: "يعني أنا أقصد" (I mean I mean...) - Trailing conjunctions: "خلاص فهمت، بس" (Okay I understood, but...) ## Dataset Statistics ### Label Distribution - **EOU (label=1):** 6,055 samples (60.55%) - **Non-EOU (label=0):** 3,945 samples (39.45%) ### Style Distribution - **Formal:** 4,163 samples (41.63%) - **Informal:** 3,674 samples (36.74%) - **ASR-like:** 2,163 samples (21.63%) ### Quality Metrics - **Duplicates:** 5.55% (555 samples) - **Unique last words:** 3,885 (38.85% of dataset) - **Ellipsis bias:** 0% (no punctuation crutches) - **Average word count:** 5.77 words - **Average character count:** 28.91 characters ### Domain Coverage The dataset covers 8 conversation domains: - Restaurant (reservations, food ordering) - Banking (account inquiries, transactions) - Hospitality (hotel bookings, travel) - Healthcare (appointments, health inquiries) - Social (friends, family conversations) - Retail (shopping, negotiations) - Transportation (car rental, rides) - Professional (job interviews, work discussions) ## Intended Uses ### Primary Use Cases 1. **Fine-tuning EOU detection models** for Arabic voice agents 2. **Training real-time conversational AI** systems for Saudi market 3. **Benchmarking Arabic NLP models** on EOU detection task 4. **LiveKit agent integration** for production voice applications ### Out-of-Scope Uses - General Arabic language modeling (dataset is specific to EOU detection) - Non-Saudi Arabic dialects (optimized for Najdi dialect) - Long-form text classification (utterances are short, 1-12 words) ## Limitations 1. **Synthetic Data:** Generated by LLMs, not human-annotated 2. **Duplicate Rate:** 5.55% duplicates (above ideal <1% threshold) 3. **Style-Label Imbalance:** Formal style is 93.7% EOU, ASR-like is 84.5% non-EOU (reflects realistic patterns) 4. **Dialect Specificity:** Optimized for Saudi Najdi dialect, may not generalize to other Arabic dialects ## Ethical Considerations - **Synthetic Generation:** No personally identifiable information (PII) - **Cultural Sensitivity:** Avoids real names, brands, or sensitive topics - **Bias Mitigation:** Actively eliminates punctuation bias and word-based crutches - **Transparency:** Full generation process documented ## Citation If you use this dataset, please cite: ```bibtex @dataset{arabic_eou_detection_10k, title={Arabic End-of-Utterance Detection Dataset (Saudi Dialect)}, author={MrEzzat}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/MrEzzat/arabic-eou-detection-10k} } ``` ## License This dataset is released under the **MIT License**. ## Dataset Card Authors - **Created by:** MrEzzat - **Date:** December 2025 - **Version:** 1.0 ## Additional Information ### Dataset Curators This dataset was created as part of the HAMS (Arabic EOU Detection) project for LiveKit voice agent integration. ### Funding Self-funded research project. ### Contact For questions or feedback, please open an issue on the [GitHub repository](https://github.com/Ahmed-Ezzat20/hams_task). --- **Quality Score:** 85.8/100 (GOOD - Ready for training) **Status:** Production-ready for fine-tuning Arabic EOU detection models.

提供机构：

MrEzzat

5,000+

优质数据集

54 个

任务类型

进入经典数据集