fr3on/arabic-dialect-corpus

Name: fr3on/arabic-dialect-corpus
Creator: fr3on
Published: 2026-01-16 18:41:31
License: 暂无描述

Hugging Face2026-01-16 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/fr3on/arabic-dialect-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ar - arz tags: - arabic - egyptian - saudi - dialect - colloquial - youtube - comments - nlp - text-generation - dialect-classification license: mit task_categories: - text-generation - text-classification size_categories: - 100K<n<1M pretty_name: Arabic Dialect Corpus (Egyptian & Saudi) dataset_info: features: - name: text dtype: string - name: label dtype: string - name: score dtype: int64 splits: - name: train num_bytes: 255365803 num_examples: 1991193 download_size: 123219948 dataset_size: 255365803 configs: - config_name: default data_files: - split: train path: data/train-* --- # 🇪🇬🇸🇦 Arabic Dialect Corpus (Egyptian & Saudi) ## Dataset Description This dataset contains **150K+ natural, informal Arabic text samples** scraped from high-engagement YouTube discussions. It specifically targets **Egyptian (EG)** and **Saudi (SA)** dialects, filling a critical gap in resources for training LLMs on colloquial Arabic (*Ammiya*) rather than just Modern Standard Arabic (MSA). ### Languages * **Primary Dialects**: - Egyptian Arabic (EG) - Cairene and regional Egyptian variants - Saudi Arabic (SA) - Najdi and Hijazi Gulf variants - General Arabic (AR) - Mixed or pan-dialectal colloquial Arabic * **Script**: Arabic script with colloquial spelling conventions * **Type**: Informal, conversational text ## Dataset Summary Modern Arabic exists on a spectrum from formal Modern Standard Arabic (MSA) to highly localized dialects. While MSA dominates written content, colloquial dialects (*Ammiya*) dominate everyday communication, social media, and informal contexts. This dataset provides: * **Authentic dialect data**: Real conversations from native speakers * **Regional coverage**: Two major Arabic dialect groups (Egyptian and Gulf) * **Simple labeling**: Clean 3-field schema (text, label, score) * **Quality filtering**: Community-validated content via engagement metrics * **Training-ready format**: JSONL optimized for streaming workflows ## Dataset Structure ### Data Format Each entry contains: ```json { "text": "يا جدعان الفيديو ده تحفة بجد بس محتاج شوية تظبيط في الصوت", "label": "EG", "score": 45 } ``` ### Data Fields | Field | Type | Description | |-------|------|-------------| | `text` | string | Cleaned Arabic comment text (colloquial dialect) | | `label` | string | Dialect label: "EG" (Egyptian), "SA" (Saudi), or "AR" (General Arabic) | | `score` | int64 | Community engagement score (like count) | ## Dataset Statistics ### Overview * **Total Entries**: ~150,000+ * **Source Platform**: YouTube * **Content Type**: User comments and discussions * **Dialect Coverage**: Egyptian and Saudi Arabian variants * **Average Text Length**: 15-80 words per entry * **Quality Range**: Filtered for minimum engagement and coherence ### Label Distribution | Label | Description | Percentage | |-------|-------------|------------| | `EG` | Egyptian Arabic (Cairene and regional variants) | ~60% | | `SA` | Saudi Arabic (Najdi, Hijazi variants) | ~35% | | `AR` | General colloquial Arabic (mixed or unidentified) | ~5% | ### Content Distribution The dataset draws from multiple video categories to ensure diverse vocabulary and contexts: * **Talk Shows & Podcasts**: 35% * **Technology Reviews**: 25% * **Entertainment & Comedy**: 20% * **Social Commentary**: 15% * **Other**: 5% ## Dialect Information ### Label Classification The `label` field indicates the dialect type: * **EG**: Egyptian Arabic markers detected (e.g., إزيك, يعني, عايز, كده, بتاع) * **SA**: Saudi/Gulf Arabic markers detected (e.g., وش, كيف, عندك, ياخي, حق) * **AR**: Mixed or unclear dialectal markers, general colloquial Arabic **Note**: Classification is automatic and based on dialectal keywords, video metadata, and linguistic patterns. Some entries may contain mixed dialects due to code-switching or regional overlap. ### Egyptian Arabic (EG) Egyptian Arabic is the most widely understood Arabic dialect due to Egypt's large population (~100M speakers) and cultural influence through media. **Characteristics**: * Simplified verb conjugations (no dual forms in verbs) * Distinct pronunciation (ج as "g", ق as glottal stop) * Unique vocabulary (e.g., إزيك for "how are you") * Heavy use of particles like يعني, بقى, كده ### Saudi Arabic (SA) Includes Najdi (Central) and Hijazi (Western) variants spoken by ~30M people. **Characteristics**: * Preservation of classical pronunciation (ج as "j", ق as "q") * Gulf-specific vocabulary and expressions * Different question words (وش for "what") * Distinct verb patterns and negation structures ## Use Cases ### ✅ Recommended Use Cases * **Dialect Adaptation**: Fine-tune base LLMs (Llama, Mistral, Qwen) for Egyptian/Saudi understanding * **Continued Pre-training**: Augment model knowledge with colloquial Arabic * **Sentiment Analysis**: Build classifiers for social monitoring in Egypt and KSA * **Dialect Identification**: Train discriminators to distinguish regional variants (EG vs SA vs AR) * **Code-Switching Research**: Study Arabic-English language mixing patterns * **Cultural NLP**: Analyze slang, humor, and regional expressions * **Multi-Dialect Models**: Train models that understand multiple Arabic varieties ### ⚠️ Limitations * **Platform Bias**: YouTube demographics skew younger and more tech-savvy * **Topic Bias**: Over-representation of entertainment and tech content * **Register**: Primarily informal; limited formal or professional language * **Dialect Mixing**: Contains code-switching (Arabic-English) and occasional MSA * **Size**: Moderate scale (150K) - suitable for fine-tuning but not pre-training from scratch * **Temporal**: Reflects 2023-2024 language usage and cultural references ## Loading the Dataset ### Using Hugging Face Datasets ```python from datasets import load_dataset # Load the entire dataset dataset = load_dataset("fr3on/arabic-dialect-corpus") # Access training data print(f"Dataset size: {len(dataset['train'])} examples") print(dataset['train'][0]) # Example output: # { # 'text': 'يا جدعان الفيديو ده تحفة بجد', # 'label': 'EG', # 'score': 45 # } # Iterate through examples for example in dataset['train']: print(example['text']) print(f"Dialect: {example['label']}") print(f"Quality score: {example['score']}") ``` ### Streaming Mode (for large-scale training) ```python from datasets import load_dataset # Enable streaming for memory-efficient loading dataset = load_dataset( "fr3on/arabic-dialect-corpus", split="train", streaming=True ) # Process in batches for batch in dataset.take(1000): # Your training code here pass ``` ### Filter by Dialect ```python # Load only Egyptian Arabic samples dataset = load_dataset("fr3on/arabic-dialect-corpus") egyptian_data = dataset['train'].filter( lambda x: x['label'] == 'EG' ) print(f"Egyptian subset: {len(egyptian_data)} examples") # Load only Saudi Arabic samples saudi_data = dataset['train'].filter( lambda x: x['label'] == 'SA' ) print(f"Saudi subset: {len(saudi_data)} examples") # General Arabic only general_data = dataset['train'].filter( lambda x: x['label'] == 'AR' ) print(f"General Arabic subset: {len(general_data)} examples") ``` ### Filter by Quality Score ```python # Load only high-engagement content dataset = load_dataset("fr3on/arabic-dialect-corpus") high_quality = dataset['train'].filter( lambda x: x['score'] >= 50 ) print(f"High-quality subset: {len(high_quality)} examples") ``` ## Training Examples ### Continued Language Model Pre-training ```python from datasets import load_dataset from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling ) # Load dataset dataset = load_dataset("fr3on/arabic-dialect-corpus") # Load base model (e.g., Llama 3) model_name = "meta-llama/Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Tokenize the data def tokenize_function(examples): return tokenizer( examples['text'], truncation=True, max_length=512, padding=False ) tokenized_dataset = dataset.map( tokenize_function, batched=True, remove_columns=['text', 'label', 'score'] ) # Data collator for CLM data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False # CLM, not MLM ) # Training arguments training_args = TrainingArguments( output_dir="./arabic-dialect-clm", overwrite_output_dir=True, num_train_epochs=3, per_device_train_batch_size=4, save_steps=10_000, save_total_limit=2, learning_rate=2e-5, warmup_steps=500, logging_steps=100, fp16=True, ) # Trainer trainer = Trainer( model=model, args=training_args, data_collator=data_collator, train_dataset=tokenized_dataset['train'], ) # Train trainer.train() ``` ### Using with Axolotl Create a config file `dialect-finetune.yml`: ```yaml base_model: meta-llama/Llama-3-8B model_type: LlamaForCausalLM # Dataset configuration datasets: - path: fr3on/arabic-dialect-corpus type: completion field: text # Training parameters sequence_len: 512 num_epochs: 3 micro_batch_size: 4 gradient_accumulation_steps: 4 learning_rate: 0.00002 # Output output_dir: ./outputs/arabic-dialect # Optimization fp16: true flash_attention: true ``` Then run: ```bash axolotl train dialect-finetune.yml ``` ### Dialect-Aware Sentiment Analysis ```python from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load dataset dataset = load_dataset("fr3on/arabic-dialect-corpus") # Add sentiment labels (you would need to label these) # For demonstration, we'll filter by score as proxy def add_sentiment_label(example): score = example['score'] if score >= 100: example['label'] = 2 # Positive elif score >= 20: example['label'] = 1 # Neutral else: example['label'] = 0 # Negative return example labeled_dataset = dataset['train'].map(add_sentiment_label) # Train sentiment classifier model = AutoModelForSequenceClassification.from_pretrained( "CAMeL-Lab/bert-base-arabic-camelbert-msa", num_labels=3 ) ``` ### Country-Specific Model Training ```python from datasets import load_dataset dataset = load_dataset("fr3on/arabic-dialect-corpus") # Train separate models for each dialect region dialects = ['EG', 'SA'] for dialect in dialects: # Filter by dialect label dialect_data = dataset['train'].filter( lambda x: x['label'] == dialect ) dialect_name = {'EG': 'Egyptian', 'SA': 'Saudi'}.get(dialect) print(f"Training {dialect_name} model with {len(dialect_data)} examples") # Your training code here # model = train_model(dialect_data) # model.save_pretrained(f"./models/arabic-{dialect.lower()}") # Or train a dialect classifier def add_dialect_label(example): label_map = {'EG': 0, 'SA': 1, 'AR': 2} example['label_id'] = label_map[example['label']] return example classifier_data = dataset['train'].map(add_dialect_label) # Train dialect identification model ``` ### Comparative Dialect Analysis ```python from datasets import load_dataset from collections import Counter dataset = load_dataset("fr3on/arabic-dialect-corpus") # Analyze vocabulary differences def get_top_words(label, n=100): dialect_data = dataset['train'].filter( lambda x: x['label'] == label ) all_words = [] for example in dialect_data: words = example['text'].split() all_words.extend(words) return Counter(all_words).most_common(n) # Compare Egyptian vs Saudi vocabulary egypt_words = get_top_words('EG') saudi_words = get_top_words('SA') print("Top Egyptian words:", egypt_words[:10]) print("Top Saudi words:", saudi_words[:10]) ``` ## Data Collection & Processing ### Source * **Platform**: YouTube public comments * **Selection Criteria**: Videos with high engagement (>10K views) * **Categories**: Talk shows, tech reviews, podcasts, entertainment * **Date Range**: 2023-2024 ### Processing Pipeline Our rigorous "Data Lab" pipeline ensures high quality: 1. **Ingestion** - API-based scraping of comment threads - Focus on high-traffic, organically popular videos - Collected ~300K raw comments 2. **Normalization** - Removed emojis, hashtags, and URLs - Stripped Tatweel/Kashida (مـــصـــر → مصر) - Collapsed repeated whitespace and newlines - Normalized Arabic punctuation 3. **Filtering** - **Length filter**: Removed comments with <3 words (spam/noise) - **Language detection**: Confirmed Arabic script majority - **Deduplication**: Hash-based removal of exact duplicates - **Quality threshold**: Minimum engagement score (like count ≥5) - **Bot detection**: Pattern-based removal of spam accounts - **Dialect classification**: Automatic labeling based on dialectal markers and video metadata 4. **Quality Validation** - Manual spot-checking of random samples (n=1000) - Automated profanity and toxic content filtering - Dialect verification for regional authenticity 5. **Export** - JSONL format for streaming compatibility - Metadata preservation for filtering/analysis ### Data Quality Metrics * ✅ **Deduplication Rate**: ~45% duplicates removed * ✅ **Bot Removal**: ~12% spam accounts filtered * ✅ **Quality Score Range**: 5-5000+ likes * ✅ **Manual Validation Accuracy**: 94% dialect correctness * ✅ **Text Cleanliness**: <1% non-Arabic characters ## Considerations for Using the Data ### Dialectal Arabic Characteristics Colloquial Arabic differs fundamentally from MSA: * **Phonology**: Different pronunciation rules (e.g., ج, ق sounds vary) * **Morphology**: Simplified verb conjugations and case systems * **Lexicon**: Region-specific vocabulary and loanwords * **Syntax**: More flexible word order and dropped pronouns * **Orthography**: Inconsistent spelling conventions ### Recommended Training Approaches 1. **Fine-tune multilingual Arabic models** (e.g., AraGPT2, CAMeL-BERT) rather than training from scratch 2. **Combine with MSA data** to maintain formal language understanding 3. **Use quality filtering** to focus on high-engagement content 4. **Consider domain adaptation** if targeting specific use cases (e.g., tech, entertainment) 5. **Augment with other dialect datasets** for broader coverage ### Code-Switching Handling This dataset contains natural Arabic-English code-switching (e.g., "يعني basically كده"). If training a monolingual Arabic model, consider: * Filtering or replacing English words * Using bilingual tokenizers * Training on code-switched data intentionally ### Ethical Considerations * **Public Data**: All content sourced from publicly accessible YouTube comments * **Privacy**: No personal information (names, emails, addresses) included * **Anonymization**: Author usernames removed during processing * **Bias Awareness**: Dataset reflects online youth culture and may not represent all demographics * **Cultural Sensitivity**: Content filtered for extreme hate speech but may contain strong opinions * **Intended Use**: Research and model training only; not for surveillance or profiling ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{arabic_dialect_corpus, title={Arabic Dialect Corpus (Egyptian & Saudi)}, author={fr3on}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/fr3on/arabic-dialect-corpus}, note={Natural colloquial Arabic from YouTube discussions} } ``` ## Contributing We welcome contributions to expand this corpus! You can help by: ### Data Contributions - Submit PRs with data from other Arabic dialects (Levantine, Iraqi, Moroccan) - Share preprocessing scripts for other platforms (Twitter, forums) - Provide domain-specific corpora (medical, legal, technical Arabic) ### Quality Improvements - Report mislabeled or low-quality examples - Suggest improved filtering criteria - Contribute manual dialect annotations ### How to Contribute 1. **Fork** the repository or dataset 2. **Process** your data following the existing JSONL schema: ```json { "text": "your_dialect_text", "label": "EG|SA|AR", "score": 0 } ``` 3. **Document** your data source and processing steps 4. **Submit** a pull request with clear description ## Acknowledgments * **Community**: YouTube creators and commenters for organic content * **Tools**: Hugging Face Datasets, Python ecosystem * **Inspiration**: CAMeL Lab, AraOpus, and other Arabic NLP initiatives ## Version History * **v1.1.0** (2026-01-06): Expanded dataset * 350K+ entries * **v1.0.0** (2026-01-05): Initial release * 150K+ entries * Egyptian and Saudi dialects ## License This dataset is released under the **MIT License**. You are free to: * ✅ Use for commercial and non-commercial purposes * ✅ Modify and distribute * ✅ Train models and publish results * ✅ Sublicense **Attribution**: Please cite this dataset in publications and model cards. --- **Contact & Support** * **Maintainer**: [fr3on](https://huggingface.co/fr3on) * **Issues**: [Dataset Discussions](https://huggingface.co/datasets/fr3on/arabic-dialect-corpus/discussions) * **Community**: Join us in the dataset community tab for questions and feedback **Dataset Size**: 150K+ examples | **Format**: JSONL | **License**: MIT | **Labels**: EG (Egyptian), SA (Saudi), AR (General)

提供机构：

fr3on

5,000+

优质数据集

54 个

任务类型

进入经典数据集