five

lihwak74/egyptian-dialogue

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lihwak74/egyptian-dialogue
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar - en license: cc-by-4.0 task_categories: - translation - text-generation tags: - egyptian-arabic - dialect - colloquial - ar_EG - translation - dialogue - subtitles - domain-classification pretty_name: Egyptian Arabic Dialogue Dataset size_categories: - 1K<n<10K --- # Egyptian Arabic Dialogue Dataset ## Dataset Description This dataset contains **4,322 parallel Egyptian Arabic-English dialogue pairs** with automatic domain classification. The data is extracted from TV series subtitles and features natural conversational Egyptian Arabic dialect (العامية المصرية). ### Languages - **Source**: Egyptian Arabic (ar_EG) - Colloquial dialect - **Target**: English (en) ## Dataset Summary Egyptian Arabic is one of the most widely spoken Arabic dialects, used by over 100 million speakers. This dataset provides: - Natural conversational dialogue - Colloquial expressions and idioms - Domain-classified content for specialized training - Episode context for narrative understanding ## Dataset Structure ### Data Format Each entry contains: ```json { "id": "ep01_line0001", "arabic": "خلاويص؟", "english": "Ready or not?", "episode": 1, "dialect": "egyptian", "language": "ar", "language_variant": "ar_EG", "genre": "dialogue", "domain": "general" } ``` ### Data Fields | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique identifier (format: epXX_lineYYYY) | | `arabic` | string | Egyptian Arabic text | | `english` | string | English translation | | `episode` | int | Episode number (for context) | | `dialect` | string | Dialect identifier (always "egyptian") | | `language` | string | ISO language code (always "ar") | | `language_variant` | string | Specific variant code (always "ar_EG") | | `genre` | string | Content genre (dialogue/narration) | | `domain` | string | Auto-detected content domain | ## Dataset Statistics ### Overview - **Total Entries**: 4,322 - **Episodes**: 6 - **Unique Domains**: 18 - **Unique Genres**: 2 - **Average Arabic Length**: 25.9 characters - **Average English Length**: 35.0 characters ### Domain Distribution | Domain | Count | Percentage | |--------|-------|------------| | general | 2,143 | 49.6% | | technology | 531 | 12.3% | | family | 368 | 8.5% | | horror | 281 | 6.5% | | medical | 233 | 5.4% | | romance | 136 | 3.1% | | weather | 115 | 2.7% | | food | 104 | 2.4% | | paranormal | 86 | 2.0% | | social | 55 | 1.3% | ### Episode Distribution | Episode | Entries | |---------|---------| | Episode 1 | 889 | | Episode 2 | 782 | | Episode 3 | 584 | | Episode 4 | 907 | | Episode 5 | 554 | | Episode 6 | 606 | ### Genre Distribution - **dialogue**: 4,301 (99.5%) - **narration**: 21 (0.5%) ## Domains Explained This dataset includes **automatic domain classification** using keyword-based detection: - **general** - Everyday conversation without specific domain - **family** - Family relationships, relatives, marriage - **horror** - Scary themes, ghosts, supernatural fear - **medical** - Healthcare, doctors, treatment - **technology** - Computers, phones, internet, apps - **romance** - Love, relationships, emotions - **paranormal** - Mysterious, unexplained phenomena - **weather** - Climate, meteorology, temperature - **food** - Cooking, restaurants, meals - **social** - Friends, gatherings, social life - **crime** - Police, investigation, law enforcement - **education** - Schools, universities, learning - **sports** - Games, matches, tournaments - **entertainment** - Movies, series, cinema - **legal** - Law, court, legal matters - **news** - Journalism, reports, media - **business** - Companies, economy, trading - **politics** - Government, elections, policy ## Use Cases ### ✅ Recommended Use Cases - **Egyptian Arabic Translation**: Train translation models specifically for Egyptian dialect - **Domain-Specific Models**: Train models for specific domains (medical, legal, etc.) - **Dialect Studies**: Research on Egyptian Arabic characteristics - **Conversational AI**: Build chatbots for Egyptian users - **Language Modeling**: Pre-train or fine-tune on Egyptian dialect - **Multi-Domain Learning**: Train models aware of content domains ### ⚠️ Limitations - **Domain Scope**: Limited to entertainment/dialogue domain content - **Register**: Conversational/informal language only - **Size**: 4,322 entries (relatively small for large-scale pre-training) - **Dialect Variation**: Egyptian Arabic has regional sub-dialects not captured - **Context**: Individual dialogue lines may lack broader narrative context ## Loading the Dataset ### Using Hugging Face Datasets ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("fr3on/egyptian-dialogue") # Access the data print(dataset['train'][0]) # Filter by domain medical_data = dataset['train'].filter(lambda x: x['domain'] == 'medical') # Filter by episode episode_1 = dataset['train'].filter(lambda x: x['episode'] == 1) ``` ### Using Pandas ```python import pandas as pd # Load Parquet file directly df = pd.read_parquet("data/train-00000-of-00001.parquet") # Analyze domains print(df['domain'].value_counts()) # Filter and export medical_df = df[df['domain'] == 'medical'] ``` ## Training Examples ### Translation Model ```python from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer # Load dataset dataset = load_dataset("fr3on/egyptian-dialogue") # Load model for Arabic-English translation model_name = "Helsinki-NLP/opus-mt-ar-en" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Tokenize def preprocess(examples): inputs = tokenizer(examples['arabic'], truncation=True, max_length=128) targets = tokenizer(examples['english'], truncation=True, max_length=128) inputs['labels'] = targets['input_ids'] return inputs tokenized = dataset.map(preprocess, batched=True) # Train trainer = Seq2SeqTrainer( model=model, train_dataset=tokenized['train'], eval_dataset=tokenized['test'] ) trainer.train() ``` ### Domain-Aware Training ```python from datasets import load_dataset dataset = load_dataset("fr3on/egyptian-dialogue") # Train separate models per domain for domain in ['medical', 'legal', 'technology']: domain_data = dataset['train'].filter(lambda x: x['domain'] == domain) # Train domain-specific model print(f"Training {domain} model with {len(domain_data)} examples") ``` ## Data Collection & Processing ### Source - **Origin**: Egyptian TV series subtitles - **Language**: Professional subtitle translations - **Quality**: Natural, conversational Egyptian Arabic ### Processing Pipeline 1. **Extraction**: Load from Excel subtitle files 2. **Cleaning**: Remove empty rows, very short entries 3. **Deduplication**: Hash-based duplicate removal (945 duplicates removed) 4. **Domain Detection**: Automatic classification using keyword matching 5. **Genre Classification**: Automatic dialogue vs. narration detection 6. **Validation**: Quality checks and statistics generation ### Data Quality - ✅ Deduplicated using MD5 hash matching - ✅ Filtered entries < 2 characters - ✅ Removed rows with missing translations - ✅ Normalized whitespace - ✅ Validated Arabic and English text pairs ## Considerations for Using the Data ### Egyptian Arabic Characteristics Egyptian Arabic differs significantly from Modern Standard Arabic (MSA): - **Vocabulary**: Distinct colloquial words (e.g., إزيك vs. كيف حالك) - **Grammar**: Simplified structures (e.g., no case endings) - **Pronunciation**: Different phonetics (e.g., ج pronounced as "g") - **Script**: Informal spelling conventions in spoken contexts ### Recommended Training Approaches 1. **Fine-tune multilingual models** rather than training from scratch 2. **Combine with MSA data** for better Arabic understanding 3. **Use domain filtering** for specialized applications 4. **Consider episode context** for narrative tasks 5. **Balance domain distribution** if training general model ### Ethical Considerations - **Dialect Representation**: Egyptian Arabic is one of many Arabic dialects - **Cultural Context**: Translations maintain cultural nuances - **Source Attribution**: Data from TV series subtitles - **Privacy**: No personal information included ## License This dataset is released under the **CC BY 4.0 License**. ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{egyptian_dialogue_2026, title={Egyptian Arabic Dialogue Dataset}, author={fr3on}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/fr3on/egyptian-dialogue} } ``` ## Acknowledgments - Source: Egyptian TV series subtitles - Processing: Automatic domain detection and classification - Format: Parquet for efficaient loading and storage ## Version History - **v1.0.0** (2025-12-17): Initial release - 4,322 entries - 18 domain categories - Automatic domain detection - Parquet format --- **Keywords**: Egyptian Arabic, ar_EG, dialect, colloquial, translation, dialogue, domain classification, NLP, machine translation, Arabic dialects, conversational AI, parquet **Dataset Size**: 4,322 examples | **Format**: Parquet | **License**: CC BY 4.0
提供机构:
lihwak74
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作