lihwak74/egyptian-dialogue
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lihwak74/egyptian-dialogue
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
- en
license: cc-by-4.0
task_categories:
- translation
- text-generation
tags:
- egyptian-arabic
- dialect
- colloquial
- ar_EG
- translation
- dialogue
- subtitles
- domain-classification
pretty_name: Egyptian Arabic Dialogue Dataset
size_categories:
- 1K<n<10K
---
# Egyptian Arabic Dialogue Dataset
## Dataset Description
This dataset contains **4,322 parallel Egyptian Arabic-English dialogue pairs** with automatic domain classification. The data is extracted from TV series subtitles and features natural conversational Egyptian Arabic dialect (العامية المصرية).
### Languages
- **Source**: Egyptian Arabic (ar_EG) - Colloquial dialect
- **Target**: English (en)
## Dataset Summary
Egyptian Arabic is one of the most widely spoken Arabic dialects, used by over 100 million speakers. This dataset provides:
- Natural conversational dialogue
- Colloquial expressions and idioms
- Domain-classified content for specialized training
- Episode context for narrative understanding
## Dataset Structure
### Data Format
Each entry contains:
```json
{
"id": "ep01_line0001",
"arabic": "خلاويص؟",
"english": "Ready or not?",
"episode": 1,
"dialect": "egyptian",
"language": "ar",
"language_variant": "ar_EG",
"genre": "dialogue",
"domain": "general"
}
```
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique identifier (format: epXX_lineYYYY) |
| `arabic` | string | Egyptian Arabic text |
| `english` | string | English translation |
| `episode` | int | Episode number (for context) |
| `dialect` | string | Dialect identifier (always "egyptian") |
| `language` | string | ISO language code (always "ar") |
| `language_variant` | string | Specific variant code (always "ar_EG") |
| `genre` | string | Content genre (dialogue/narration) |
| `domain` | string | Auto-detected content domain |
## Dataset Statistics
### Overview
- **Total Entries**: 4,322
- **Episodes**: 6
- **Unique Domains**: 18
- **Unique Genres**: 2
- **Average Arabic Length**: 25.9 characters
- **Average English Length**: 35.0 characters
### Domain Distribution
| Domain | Count | Percentage |
|--------|-------|------------|
| general | 2,143 | 49.6% |
| technology | 531 | 12.3% |
| family | 368 | 8.5% |
| horror | 281 | 6.5% |
| medical | 233 | 5.4% |
| romance | 136 | 3.1% |
| weather | 115 | 2.7% |
| food | 104 | 2.4% |
| paranormal | 86 | 2.0% |
| social | 55 | 1.3% |
### Episode Distribution
| Episode | Entries |
|---------|---------|
| Episode 1 | 889 |
| Episode 2 | 782 |
| Episode 3 | 584 |
| Episode 4 | 907 |
| Episode 5 | 554 |
| Episode 6 | 606 |
### Genre Distribution
- **dialogue**: 4,301 (99.5%)
- **narration**: 21 (0.5%)
## Domains Explained
This dataset includes **automatic domain classification** using keyword-based detection:
- **general** - Everyday conversation without specific domain
- **family** - Family relationships, relatives, marriage
- **horror** - Scary themes, ghosts, supernatural fear
- **medical** - Healthcare, doctors, treatment
- **technology** - Computers, phones, internet, apps
- **romance** - Love, relationships, emotions
- **paranormal** - Mysterious, unexplained phenomena
- **weather** - Climate, meteorology, temperature
- **food** - Cooking, restaurants, meals
- **social** - Friends, gatherings, social life
- **crime** - Police, investigation, law enforcement
- **education** - Schools, universities, learning
- **sports** - Games, matches, tournaments
- **entertainment** - Movies, series, cinema
- **legal** - Law, court, legal matters
- **news** - Journalism, reports, media
- **business** - Companies, economy, trading
- **politics** - Government, elections, policy
## Use Cases
### ✅ Recommended Use Cases
- **Egyptian Arabic Translation**: Train translation models specifically for Egyptian dialect
- **Domain-Specific Models**: Train models for specific domains (medical, legal, etc.)
- **Dialect Studies**: Research on Egyptian Arabic characteristics
- **Conversational AI**: Build chatbots for Egyptian users
- **Language Modeling**: Pre-train or fine-tune on Egyptian dialect
- **Multi-Domain Learning**: Train models aware of content domains
### ⚠️ Limitations
- **Domain Scope**: Limited to entertainment/dialogue domain content
- **Register**: Conversational/informal language only
- **Size**: 4,322 entries (relatively small for large-scale pre-training)
- **Dialect Variation**: Egyptian Arabic has regional sub-dialects not captured
- **Context**: Individual dialogue lines may lack broader narrative context
## Loading the Dataset
### Using Hugging Face Datasets
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("fr3on/egyptian-dialogue")
# Access the data
print(dataset['train'][0])
# Filter by domain
medical_data = dataset['train'].filter(lambda x: x['domain'] == 'medical')
# Filter by episode
episode_1 = dataset['train'].filter(lambda x: x['episode'] == 1)
```
### Using Pandas
```python
import pandas as pd
# Load Parquet file directly
df = pd.read_parquet("data/train-00000-of-00001.parquet")
# Analyze domains
print(df['domain'].value_counts())
# Filter and export
medical_df = df[df['domain'] == 'medical']
```
## Training Examples
### Translation Model
```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer
# Load dataset
dataset = load_dataset("fr3on/egyptian-dialogue")
# Load model for Arabic-English translation
model_name = "Helsinki-NLP/opus-mt-ar-en"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Tokenize
def preprocess(examples):
inputs = tokenizer(examples['arabic'], truncation=True, max_length=128)
targets = tokenizer(examples['english'], truncation=True, max_length=128)
inputs['labels'] = targets['input_ids']
return inputs
tokenized = dataset.map(preprocess, batched=True)
# Train
trainer = Seq2SeqTrainer(
model=model,
train_dataset=tokenized['train'],
eval_dataset=tokenized['test']
)
trainer.train()
```
### Domain-Aware Training
```python
from datasets import load_dataset
dataset = load_dataset("fr3on/egyptian-dialogue")
# Train separate models per domain
for domain in ['medical', 'legal', 'technology']:
domain_data = dataset['train'].filter(lambda x: x['domain'] == domain)
# Train domain-specific model
print(f"Training {domain} model with {len(domain_data)} examples")
```
## Data Collection & Processing
### Source
- **Origin**: Egyptian TV series subtitles
- **Language**: Professional subtitle translations
- **Quality**: Natural, conversational Egyptian Arabic
### Processing Pipeline
1. **Extraction**: Load from Excel subtitle files
2. **Cleaning**: Remove empty rows, very short entries
3. **Deduplication**: Hash-based duplicate removal (945 duplicates removed)
4. **Domain Detection**: Automatic classification using keyword matching
5. **Genre Classification**: Automatic dialogue vs. narration detection
6. **Validation**: Quality checks and statistics generation
### Data Quality
- ✅ Deduplicated using MD5 hash matching
- ✅ Filtered entries < 2 characters
- ✅ Removed rows with missing translations
- ✅ Normalized whitespace
- ✅ Validated Arabic and English text pairs
## Considerations for Using the Data
### Egyptian Arabic Characteristics
Egyptian Arabic differs significantly from Modern Standard Arabic (MSA):
- **Vocabulary**: Distinct colloquial words (e.g., إزيك vs. كيف حالك)
- **Grammar**: Simplified structures (e.g., no case endings)
- **Pronunciation**: Different phonetics (e.g., ج pronounced as "g")
- **Script**: Informal spelling conventions in spoken contexts
### Recommended Training Approaches
1. **Fine-tune multilingual models** rather than training from scratch
2. **Combine with MSA data** for better Arabic understanding
3. **Use domain filtering** for specialized applications
4. **Consider episode context** for narrative tasks
5. **Balance domain distribution** if training general model
### Ethical Considerations
- **Dialect Representation**: Egyptian Arabic is one of many Arabic dialects
- **Cultural Context**: Translations maintain cultural nuances
- **Source Attribution**: Data from TV series subtitles
- **Privacy**: No personal information included
## License
This dataset is released under the **CC BY 4.0 License**.
## Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{egyptian_dialogue_2026,
title={Egyptian Arabic Dialogue Dataset},
author={fr3on},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/fr3on/egyptian-dialogue}
}
```
## Acknowledgments
- Source: Egyptian TV series subtitles
- Processing: Automatic domain detection and classification
- Format: Parquet for efficaient loading and storage
## Version History
- **v1.0.0** (2025-12-17): Initial release
- 4,322 entries
- 18 domain categories
- Automatic domain detection
- Parquet format
---
**Keywords**: Egyptian Arabic, ar_EG, dialect, colloquial, translation, dialogue, domain classification, NLP, machine translation, Arabic dialects, conversational AI, parquet
**Dataset Size**: 4,322 examples | **Format**: Parquet | **License**: CC BY 4.0
提供机构:
lihwak74



