Prickly-Labs/1.9M-Egyptian-Corpus

Name: Prickly-Labs/1.9M-Egyptian-Corpus
Creator: Prickly-Labs
Published: 2026-04-16 18:59:57
License: 暂无描述

Hugging Face2026-04-16 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/Prickly-Labs/1.9M-Egyptian-Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 1347332093 num_examples: 1922049 download_size: 647712671 dataset_size: 1347332093 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - text-generation language: - ar size_categories: - 1M<n<10M license: cc-by-sa-4.0 --- # 1.92M Egyptian Arabic Corpus 🇪🇬 — Prickly Labs A corpus of **1.92 million Egyptian Arabic samples** combining synthetic generation and real web data, built for continued pretraining and dialect adaptation of Arabic language models. Designed to reflect how Egyptians actually speak — not textbook MSA. > ⚠️ This dataset contains uncensored, informal Arabic including sarcasm, humor, slang, and profanity. Use with care for public-facing applications. --- ## 📌 Overview | Property | Value | |---|---| | **Samples** | 1,922,049 | | **Language** | Egyptian Arabic (عامية مصرية) | | **Script** | Arabic (no tashkeel, normalized alef) | | **Use Cases** | Continued pretraining (CPT), dialectal pre-SFT, informal Arabic modeling, ChatML finetuning | | **Sources** | Synthetic (Gemini API, LearnLM) + Reddit scrape (Egyptian subreddits) | | **License** | CC BY-SA 4.0 | --- ## 🗂️ Data Sources ### Synthetic — majority of corpus Generated using a custom multi-worker pipeline built on the Gemini API. The pipeline randomized topics, tones, and prompt structures across parallel instances to maximize variety. A significant portion of the synthetic data was generated as structured multi-turn dialogues and converted to ChatML format via regex extraction before being flattened into plain text for this release. ### Reddit scrape — minority of corpus Scraped from Egyptian Arabic subreddits over approximately one week of daily collection runs. Provides authentic, unscripted dialect and slang that synthetic generation cannot fully replicate. --- ## 🔧 Preprocessing This corpus went through several rounds of filtering and cleaning: - **Deduplication** applied per source batch and again after merging — Gemini outputs at scale produce significant repetition which was aggressively removed - **Language filtering** — any sample with more than ~40% non-Arabic characters was removed - **Character filtering** — samples containing non-Arabic scripts were dropped - **Emoji filtering** — emoji-only or emoji-heavy samples removed - **Tashkeel removal** — all Arabic diacritics stripped to reduce tokenizer vocabulary noise - **Alef normalization** — أ، إ، آ all normalized to ا for tokenizer consistency - **Manual fixes** — early batches required character-level corrections before scripted filtering was in place - **Loop filtering** — samples with repeating chunks (a known Gemini API artifact at scale) were detected and removed; approximately 20,000 samples removed in this pass Approximately 300,000 samples were lost across filtering passes. The final 1.92M represents what survived all stages. --- ## 📂 Format **File type:** `.parquet` **Structure:** ```json {"text": "..."} ``` ### Load with Hugging Face Datasets: ```python from datasets import load_dataset dataset = load_dataset("Prickly-Labs/1.9M-Egyptian-Corpus", split="train") ``` Supports streaming: ```python dataset = load_dataset("Prickly-Labs/1.9M-Egyptian-Corpus", split="train", streaming=True) ``` --- ## ⚠️ Known Limitations - **Topic distribution** — synthetic portion skews toward informational and explanatory content due to prompt design; conversational and narrative registers are present but less dominant - **Alef normalization** — أ / إ / آ collapsed to ا loses some orthographic disambiguation; models trained on this data will inherit that normalization - **No tashkeel** — not suitable as-is for tasks requiring fully vocalized Arabic - **Synthetic artifacts** — despite deduplication, synthetic data may carry subtle repetition patterns in phrasing or structure - **Mixed register** — Reddit portion introduces some MSA and mixed Arabic-dialect text alongside pure Egyptian dialect --- ## 🛠️ Built By **Ahmed Sherief** — Founder of Prickly Labs, designed and built the generation pipeline, scraping infrastructure, and all filtering and preprocessing scripts. These contributors ran generation workers on their own machines and helped accelerate the data collection process: - **Mohamed Hafez** - **Moaaz Saad** --- ## 📜 License **Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)** You are free to use, share, and adapt this dataset for any purpose including commercial use, provided that: - You give appropriate credit to Prickly Labs - Any models or datasets derived from this work are released under the same CC BY-SA 4.0 license → https://creativecommons.org/licenses/by-sa/4.0/ --- ## 🧪 About Prickly Labs Prickly Labs builds emotionally grounded, culturally fluent Arabic AI — crafted by Arabs, for Arabs. We believe language models should reflect how people truly speak, not just textbook MSA. → https://huggingface.co/Prickly-Labs

提供机构：

Prickly-Labs

5,000+

优质数据集

54 个

任务类型

进入经典数据集