Prickly-Labs/1.9M-Egyptian-Corpus
收藏Hugging Face2026-04-16 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Prickly-Labs/1.9M-Egyptian-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 1347332093
num_examples: 1922049
download_size: 647712671
dataset_size: 1347332093
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- text-generation
language:
- ar
size_categories:
- 1M<n<10M
license: cc-by-sa-4.0
---
# 1.92M Egyptian Arabic Corpus 🇪🇬 — Prickly Labs
A corpus of **1.92 million Egyptian Arabic samples** combining synthetic generation and real web data, built for continued pretraining and dialect adaptation of Arabic language models. Designed to reflect how Egyptians actually speak — not textbook MSA.
> ⚠️ This dataset contains uncensored, informal Arabic including sarcasm, humor, slang, and profanity. Use with care for public-facing applications.
---
## 📌 Overview
| Property | Value |
|---|---|
| **Samples** | 1,922,049 |
| **Language** | Egyptian Arabic (عامية مصرية) |
| **Script** | Arabic (no tashkeel, normalized alef) |
| **Use Cases** | Continued pretraining (CPT), dialectal pre-SFT, informal Arabic modeling, ChatML finetuning |
| **Sources** | Synthetic (Gemini API, LearnLM) + Reddit scrape (Egyptian subreddits) |
| **License** | CC BY-SA 4.0 |
---
## 🗂️ Data Sources
### Synthetic — majority of corpus
Generated using a custom multi-worker pipeline built on the Gemini API. The pipeline randomized topics, tones, and prompt structures across parallel instances to maximize variety. A significant portion of the synthetic data was generated as structured multi-turn dialogues and converted to ChatML format via regex extraction before being flattened into plain text for this release.
### Reddit scrape — minority of corpus
Scraped from Egyptian Arabic subreddits over approximately one week of daily collection runs. Provides authentic, unscripted dialect and slang that synthetic generation cannot fully replicate.
---
## 🔧 Preprocessing
This corpus went through several rounds of filtering and cleaning:
- **Deduplication** applied per source batch and again after merging — Gemini outputs at scale produce significant repetition which was aggressively removed
- **Language filtering** — any sample with more than ~40% non-Arabic characters was removed
- **Character filtering** — samples containing non-Arabic scripts were dropped
- **Emoji filtering** — emoji-only or emoji-heavy samples removed
- **Tashkeel removal** — all Arabic diacritics stripped to reduce tokenizer vocabulary noise
- **Alef normalization** — أ، إ، آ all normalized to ا for tokenizer consistency
- **Manual fixes** — early batches required character-level corrections before scripted filtering was in place
- **Loop filtering** — samples with repeating chunks (a known Gemini API artifact at scale) were detected and removed; approximately 20,000 samples removed in this pass
Approximately 300,000 samples were lost across filtering passes. The final 1.92M represents what survived all stages.
---
## 📂 Format
**File type:** `.parquet`
**Structure:**
```json
{"text": "..."}
```
### Load with Hugging Face Datasets:
```python
from datasets import load_dataset
dataset = load_dataset("Prickly-Labs/1.9M-Egyptian-Corpus", split="train")
```
Supports streaming:
```python
dataset = load_dataset("Prickly-Labs/1.9M-Egyptian-Corpus", split="train", streaming=True)
```
---
## ⚠️ Known Limitations
- **Topic distribution** — synthetic portion skews toward informational and explanatory content due to prompt design; conversational and narrative registers are present but less dominant
- **Alef normalization** — أ / إ / آ collapsed to ا loses some orthographic disambiguation; models trained on this data will inherit that normalization
- **No tashkeel** — not suitable as-is for tasks requiring fully vocalized Arabic
- **Synthetic artifacts** — despite deduplication, synthetic data may carry subtle repetition patterns in phrasing or structure
- **Mixed register** — Reddit portion introduces some MSA and mixed Arabic-dialect text alongside pure Egyptian dialect
---
## 🛠️ Built By
**Ahmed Sherief** — Founder of Prickly Labs, designed and built the generation pipeline, scraping infrastructure, and all filtering and preprocessing scripts.
These contributors ran generation workers on their own machines and helped accelerate the data collection process:
- **Mohamed Hafez**
- **Moaaz Saad**
---
## 📜 License
**Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)**
You are free to use, share, and adapt this dataset for any purpose including commercial use, provided that:
- You give appropriate credit to Prickly Labs
- Any models or datasets derived from this work are released under the same CC BY-SA 4.0 license
→ https://creativecommons.org/licenses/by-sa/4.0/
---
## 🧪 About Prickly Labs
Prickly Labs builds emotionally grounded, culturally fluent Arabic AI — crafted by Arabs, for Arabs.
We believe language models should reflect how people truly speak, not just textbook MSA.
→ https://huggingface.co/Prickly-Labs
提供机构:
Prickly-Labs



