faresrafat/arastudy-arabic-wikipedia-cleaned
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/faresrafat/arastudy-arabic-wikipedia-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
language: ar
license: apache-2.0
tags:
- arabic
- wikipedia
- cleaned
- language-modeling
---
# AraStudy Arabic Wikipedia (Cleaned)
Cleaned Arabic Wikipedia corpus used in AraStudy.
## Stats
- Total lines: 1,390,451
- Total words: 84,025,327
- Total chars: 488,667,555
- Split: 90/5/5
## Cleaning Pipeline
- Removed HTML, URLs, Tatweel
- Arabic normalization (أ/إ/آ→ا, ى→ي, ة→ه)
- Diacritics removed
- Hash-based deduplication
- Min 10 words per line
- 60% Arabic ratio filter
## Source
Arabic Wikipedia (wikimedia/wikipedia, 20231101.ar)
First 300,000 articles, filtered to 1.39M clean lines.
## Part of AraStudy
- GitHub: https://github.com/faresrafat3/arastudy
提供机构:
faresrafat



