ArabicText-Large
收藏魔搭社区2025-12-04 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/ArabicText-Large
下载链接
链接失效反馈官方服务:
资源简介:
# ArabicText-Large: High-Quality Arabic Corpus for LLM Training







## Dataset Summary
**ArabicText-Large** is a comprehensive, high-quality Arabic text corpus comprising **743,288 articles** with over **244 million words**, specifically curated for Large Language Model (LLM) training and fine-tuning. This dataset represents one of the largest publicly available Arabic text collections for machine learning research.
This corpus addresses the critical shortage of high-quality Arabic NLP resources through rigorous preprocessing, quality filtering, and validation protocols.
*Built by [RightNow AI](https://www.rightnowai.co/), the first GPU-native AI code editor.*
**Dataset DOI**: [https://doi.org/10.57967/hf/6685](https://doi.org/10.57967/hf/6685)
## Key Features
- **Massive Scale**: 743,288 articles with 244 million words
- **High Quality**: Multi-stage cleaning and quality filtering (average quality score: 58.3%)
- **LLM-Ready**: Optimized JSONL format for direct use in training pipelines
- **Diverse Content**: 9 major topic categories (History, Science, Geography, Biography, Arts, Politics, Religion, Sports)
- **Clean Text**: Professional removal of artifacts, references, and formatting noise
- **Modern Standard Arabic**: 94.2% Arabic content purity
- **Rich Vocabulary**: 1.5 million unique words
- **Open License**: Apache 2.0 for commercial and research use
- **Persistent DOI**: Permanently citable via DOI 10.57967/hf/6685
## Dataset Statistics
| Metric | Value |
|--------|-------|
| **Total Articles** | 743,288 |
| **Total Words** | 244,153,780 |
| **Total Sentences** | 12,392,064 |
| **Unique Words** | 1,529,064 |
| **Average Words/Article** | 328.5 |
| **Average Sentences/Article** | 16.7 |
| **Average Words/Sentence** | 19.7 |
| **Vocabulary Richness** | 0.0063 |
| **Dataset Size** | 2.8 GB (compressed) |
| **Arabic Content Purity** | 94.2% |
## Content Distribution
| Topic Category | Articles | Percentage |
|----------------|----------|------------|
| History & Culture | 156,090 | 21.0% |
| Science & Technology | 148,657 | 20.0% |
| Geography & Places | 133,792 | 18.0% |
| Biography | 111,493 | 15.0% |
| Arts & Literature | 89,194 | 12.0% |
| Politics & Society | 74,329 | 10.0% |
| Religion | 66,863 | 9.0% |
| Sports | 51,830 | 7.0% |
| Other Topics | 22,298 | 3.0% |
## Quality Assessment
| Quality Tier | Articles | Percentage |
|--------------|----------|------------|
| **Excellent** (≥80%) | 130,373 | 17.5% |
| **Good** (60-80%) | 306,526 | 41.2% |
| **Fair** (40-60%) | 306,389 | 41.2% |
**Average Quality Score**: 58.3%
**High-Quality Articles (≥60%)**: 58.7%
## Usage
### Loading with Hugging Face Datasets
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Jr23xd23/ArabicText-Large")
# Access the training split
train_data = dataset["train"]
print(f"Total articles: {len(train_data)}")
# Access a single article
article = train_data[0]
print(f"Title: {article['title']}")
print(f"Text: {article['text'][:200]}...")
```
### Loading with Python
```python
import json
articles = []
with open('data.jsonl', 'r', encoding='utf-8') as f:
for line in f:
article = json.loads(line)
articles.append(article)
print(f"Loaded {len(articles)} articles")
```
### Data Format
Each entry in the dataset follows this structure:
```json
{
"id": "unique_article_identifier",
"title": "Article Title in Arabic",
"text": "Full cleaned Arabic text content...",
"url": "source_url",
"metadata": {
"language": "ar",
"source": "Curated Sources",
"cleaned": true,
"processing_date": "2025-01-23T00:00:00",
"quality_score": 75.5
}
}
```
## Use Cases
### Language Model Pre-training
- **BERT-style models**: Masked language modeling, text understanding
- **GPT-style models**: Causal language modeling, text generation
- **T5-style models**: Encoder-decoder architectures, sequence-to-sequence tasks
- **Fine-tuning**: Domain adaptation for Arabic-specific applications
### Downstream NLP Tasks
- **Text Classification**: Sentiment analysis, topic classification, intent detection
- **Named Entity Recognition**: Entity extraction and tagging
- **Question Answering**: Reading comprehension, information retrieval
- **Text Summarization**: Abstractive and extractive summarization
- **Machine Translation**: Arabic-English, Arabic-French, multilingual translation
- **Information Extraction**: Relationship extraction, knowledge graph construction
### Research Applications
- Arabic linguistics and computational morphology
- Cross-lingual transfer learning
- Multilingual model development
- Low-resource language processing research
- Comparative studies of Semitic languages
## Data Processing Pipeline
Our multi-stage processing ensures the highest quality:
1. **Source Collection**: Curated from reliable, peer-reviewed sources
2. **Artifact Removal**: Eliminated references, citations, and navigation elements
3. **Text Normalization**: Arabic-specific normalization (diacritics, punctuation, whitespace)
4. **Quality Filtering**: Minimum 70% Arabic content, length constraints
5. **Quality Scoring**: Multi-dimensional assessment (structure, linguistics, coherence)
6. **Deduplication**: Hash-based exact matching + MinHash LSH for near-duplicate removal
7. **Validation**: Format verification, encoding checks, statistical validation
### Quality Criteria
Articles are retained only if they meet all criteria:
- Minimum 100 characters, maximum 50,000 characters
- At least 70% Arabic characters
- Minimum 3 sentences for substantive content
- Quality score ≥40% on multi-dimensional assessment
- No stub indicators (e.g., "بحاجة للتوسيع")
## Dataset Metrics
### Length Distributions
**Article Lengths:**
- Minimum: 50 words
- Maximum: 20,757 words
- Median: 106 words
- Mean: 328.5 words
- Standard Deviation: 584.2 words
**Sentence Lengths:**
- Minimum: 1 word
- Maximum: 247 words
- Median: 16 words
- Mean: 19.7 words
- Standard Deviation: 12.3 words
**Word Lengths:**
- Minimum: 1 character
- Maximum: 42 characters
- Median: 4 characters
- Mean: 4.9 characters
- Standard Deviation: 2.8 characters
### Vocabulary Statistics
- **Total Unique Words**: 1,529,064
- **Vocabulary Richness**: 0.0063
- **Follows Zipf's Law**: Yes (natural language distribution)
**Most Frequent Words:**
| Rank | Word (Arabic) | Translation | Frequency | Percentage |
|------|---------------|-------------|-----------|------------|
| 1 | في | in | 9,778,012 | 4.01% |
| 2 | من | from | 7,346,952 | 3.01% |
| 3 | على | on | 3,324,220 | 1.36% |
| 4 | إلى | to | 2,453,720 | 1.01% |
| 5 | أن | that | 1,595,356 | 0.65% |
## Technical Specifications
- **Format**: JSONL (JSON Lines)
- **Encoding**: UTF-8
- **Language**: Modern Standard Arabic (ar)
- **Total Size**: 2.8 GB (compressed)
- **Processing Date**: January 2025
- **License**: Apache 2.0
- **Python Compatibility**: 3.7+
- **DOI**: 10.57967/hf/6685
## Comparison with Other Arabic Datasets
| Dataset | Words | Articles | Domain | Quality | Year | License |
|---------|-------|----------|--------|---------|------|---------|
| Arabic Gigaword | 848M | N/A | News | Moderate | 2011 | LDC |
| AraBERT Corpus | 70M | N/A | Mixed | Good | 2020 | MIT |
| OSCAR-Arabic | 22B | N/A | Web | Variable | 2019 | CC0 |
| mC4-Arabic | 42B | N/A | Web | Variable | 2021 | ODC-BY |
| **ArabicText-Large** | **244M** | **743K** | **Encyclopedia** | **High** | **2025** | **Apache 2.0** |
## Limitations
- **Dialectal Coverage**: Primarily Modern Standard Arabic (MSA); limited dialectal variations
- **Domain Bias**: Encyclopedic content may not represent colloquial or conversational Arabic
- **Temporal Coverage**: Content reflects knowledge up to dataset collection date (January 2025)
- **Size Trade-off**: Smaller than billion-word web corpora but prioritizes quality over quantity
## Future Enhancements
Planned improvements include:
- Dialectal Arabic expansion (Egyptian, Levantine, Gulf, Maghrebi)
- Domain diversification (literature, technical documents, news, social media)
- Parallel corpus creation (Arabic-English alignments)
- Linguistic annotations (POS tags, NER, dependency parsing)
- Regular updates with new content and quality improvements
## License
This dataset is released under the **Apache License 2.0**.
```
Copyright 2025 Jaber Jaber
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```
## Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{jaber_2025,
author = {Jaber, Jaber},
title = {ArabicText-Large: A High-Quality 244-Million-Word Corpus for Arabic Language Model Training},
year = 2025,
url = {https://huggingface.co/datasets/Jr23xd23/ArabicText-Large},
doi = {10.57967/hf/6685},
publisher = {Hugging Face}
}
```
**Research Paper:**
```bibtex
@article{jaber2025arabictext,
title={ArabicText-Large: A High-Quality 244-Million-Word Corpus for Arabic Language Model Training},
author={Jaber, Jaber},
journal={Journal of Open Humanities Data},
year={2025},
doi={10.57967/hf/6685},
url={https://huggingface.co/datasets/Jr23xd23/ArabicText-Large}
}
```
## Contributing
We welcome community contributions:
- **Bug Reports**: Report data quality issues or inconsistencies
- **Feature Requests**: Suggest dataset improvements or extensions
- **Pull Requests**: Contribute preprocessing enhancements or tools
- **Feedback**: Share your usage experience and research outcomes
## Contact
For questions, collaborations, or research inquiries:
**Author**: Jaber Jaber
**Organization**: RightNow AI
**Email**: jaber@rightnowai.co
**Website**: https://www.rightnowai.co
## Acknowledgments
We extend our gratitude to:
- The Arabic NLP research community for valuable feedback and insights
- Open-source contributors for tools and frameworks that made this work possible
- Researchers and practitioners using this dataset to advance Arabic language technologies
---
**Dataset Homepage**: [ArabicText-Large on Hugging Face](https://huggingface.co/datasets/Jr23xd23/ArabicText-Large)
**DOI**: [https://doi.org/10.57967/hf/6685](https://doi.org/10.57967/hf/6685)
**License**: Apache 2.0
**Author**: Jaber Jaber
**Year**: 2025
*Advancing Arabic NLP research and development*
# ArabicText-Large:面向大语言模型(Large Language Model, LLM)训练的高质量阿拉伯语语料库







## 数据集概览
**ArabicText-Large** 是一款全面且高质量的阿拉伯语文本语料库,包含**743,288篇文章**,总单词数超**2.44亿**,专为大语言模型(Large Language Model, LLM)的训练与微调精心打造。本数据集是当前公开可用的规模最大的阿拉伯语文本集合之一,可用于机器学习研究。
本语料库通过严格的预处理、质量过滤与验证流程,解决了高质量阿拉伯语自然语言处理(Natural Language Processing, NLP)资源严重短缺的痛点。
*由首款原生GPU加速的AI代码编辑器开发商[RightNow AI](https://www.rightnowai.co/)制作*
**数据集DOI**:[https://doi.org/10.57967/hf/6685](https://doi.org/10.57967/hf/6685)
## 核心特性
- **海量规模**:743,288篇文章,总计2.44亿单词
- **高质量**:经过多阶段清洗与质量过滤(平均质量得分:58.3%)
- **适配大语言模型**:采用优化后的JSONL格式,可直接用于训练流水线
- **内容多元**:涵盖9大主题类别(历史、科学、地理、传记、艺术、政治、宗教、体育)
- **文本纯净**:专业移除冗余标记、参考文献与格式噪声
- **现代标准阿拉伯语**:阿拉伯语内容纯度达94.2%
- **词汇丰富**:包含150万个独特单词
- **开放许可**:采用Apache 2.0协议,可用于商业与研究用途
- **持久可引用**:可通过DOI 10.57967/hf/6685永久引用
## 数据集统计数据
| 指标 | 数值 |
|--------|-------|
| **文章总数** | 743,288 |
| **单词总数** | 244,153,780 |
| **句子总数** | 12,392,064 |
| **独特单词数** | 1,529,064 |
| **单篇文章平均单词数** | 328.5 |
| **单篇文章平均句子数** | 16.7 |
| **单句平均单词数** | 19.7 |
| **词汇丰富度** | 0.0063 |
| **数据集规模** | 2.8 GB(压缩版) |
| **阿拉伯语内容纯度** | 94.2% |
## 内容分布
| 主题类别 | 文章数量 | 占比 |
|----------------|----------|------------|
| 历史与文化 | 156,090 | 21.0% |
| 科学与技术 | 148,657 | 20.0% |
| 地理与地域 | 133,792 | 18.0% |
| 传记 | 111,493 | 15.0% |
| 艺术与文学 | 89,194 | 12.0% |
| 政治与社会 | 74,329 | 10.0% |
| 宗教 | 66,863 | 9.0% |
| 体育 | 51,830 | 7.0% |
| 其他主题 | 22,298 | 3.0% |
## 质量评估
| 质量等级 | 文章数量 | 占比 |
|--------------|----------|------------|
| **优秀(≥80%)** | 130,373 | 17.5% |
| **良好(60%-80%)** | 306,526 | 41.2% |
| **合格(40%-60%)** | 306,389 | 41.2% |
**平均质量得分**:58.3%
**高质量文章占比(≥60%)**:58.7%
## 使用方法
### 使用Hugging Face Datasets加载
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("Jr23xd23/ArabicText-Large")
# 访问训练子集
train_data = dataset["train"]
print(f"总文章数:{len(train_data)}")
# 访问单篇文章
article = train_data[0]
print(f"标题:{article['title']}")
print(f"文本:{article['text'][:200]}...")
### 使用Python加载
python
import json
articles = []
with open('data.jsonl', 'r', encoding='utf-8') as f:
for line in f:
article = json.loads(line)
articles.append(article)
print(f"已加载 {len(articles)} 篇文章")
### 数据格式
数据集中的每条条目遵循以下结构:
json
{
"id": "唯一文章标识符",
"title": "阿拉伯语文章标题",
"text": "清理后的完整阿拉伯语文本内容...",
"url": "来源URL",
"metadata": {
"language": "ar",
"source": "精选来源",
"cleaned": true,
"processing_date": "2025-01-23T00:00:00",
"quality_score": 75.5
}
}
## 应用场景
### 语言模型预训练
- **BERT类模型**:掩码语言建模、文本理解
- **GPT类模型**:自回归语言建模、文本生成
- **T5类模型**:编码器-解码器架构、序列到序列任务
- **微调**:面向阿拉伯语特定应用的领域自适应
### 下游自然语言处理任务
- **文本分类**:情感分析、主题分类、意图识别
- **命名实体识别(Named Entity Recognition, NER)**:实体抽取与标注
- **问答任务**:阅读理解、信息检索
- **文本摘要**:抽象式与抽取式摘要
- **机器翻译**:阿拉伯语-英语、阿拉伯语-法语及多语言翻译
- **信息抽取**:关系抽取、知识图谱构建
### 研究应用
- 阿拉伯语语言学与计算形态学
- 跨语言迁移学习
- 多语言模型开发
- 低资源语言处理研究
- 闪含语系语言对比研究
## 数据处理流水线
我们通过多阶段流程确保最高数据质量:
1. **源数据收集**:从可靠的同行评审来源精选获取
2. **冗余标记移除**:移除参考文献、引用与导航元素
3. **文本归一化**:针对阿拉伯语的归一化处理(变音符号、标点、空格)
4. **质量过滤**:要求阿拉伯语内容占比不低于70%,并设置长度限制
5. **质量评分**:多维度评估(结构、语言学特性、连贯性)
6. **去重**:基于哈希的精确匹配+MinHash局部敏感哈希(MinHash LSH)移除近似重复内容
7. **验证**:格式验证、编码检查、统计验证
### 质量标准
仅当文章满足所有以下标准时方可被保留:
- 字符数介于100至50,000之间
- 阿拉伯语字符占比不低于70%
- 至少包含3个句子以保证内容充实
- 多维度评估的质量得分≥40%
- 无占位符标记(如"بحاجة للتوسيع",意为"待扩充")
## 数据集指标
### 长度分布
**文章长度**:
- 最小值:50个单词
- 最大值:20,757个单词
- 中位数:106个单词
- 平均值:328.5个单词
- 标准差:584.2个单词
**句子长度**:
- 最小值:1个单词
- 最大值:247个单词
- 中位数:16个单词
- 平均值:19.7个单词
- 标准差:12.3个单词
**单词长度**:
- 最小值:1个字符
- 最大值:42个字符
- 中位数:4个字符
- 平均值:4.9个字符
- 标准差:2.8个字符
### 词汇统计
- **独特单词总数**:1,529,064
- **词汇丰富度**:0.0063
- **符合齐夫定律(Zipf's Law)**:是(符合自然语言分布规律)
**高频单词**:
| 排名 | 阿拉伯语单词 | 中文翻译 | 出现频次 | 占比 |
|------|---------------|-------------|-----------|------------|
| 1 | في | 在 | 9,778,012 | 4.01% |
| 2 | من | 从/自 | 7,346,952 | 3.01% |
| 3 | على | 在……之上 | 3,324,220 | 1.36% |
| 4 | إلى | 到/向 | 2,453,720 | 1.01% |
| 5 | أن | 那/该 | 1,595,356 | 0.65% |
## 技术规格
- **格式**:JSONL(JSON Lines)
- **编码**:UTF-8
- **语言**:现代标准阿拉伯语(ar)
- **总规模**:2.8 GB(压缩版)
- **处理日期**:2025年1月
- **许可协议**:Apache 2.0
- **Python兼容性**:3.7及以上版本
- **DOI**:10.57967/hf/6685
## 与其他阿拉伯语数据集的对比
| 数据集 | 单词数 | 文章数 | 领域 | 质量 | 发布年份 | 许可协议 |
|---------|-------|----------|--------|---------|------|---------|
| 阿拉伯语巨字语料库(Arabic Gigaword) | 848M | N/A | 新闻 | 中等 | 2011 | LDC |
| AraBERT语料库 | 70M | N/A | 混合 | 良好 | 2020 | MIT |
| OSCAR-阿拉伯语语料库 | 22B | N/A | 网页 | 可变 | 2019 | CC0 |
| mC4-阿拉伯语语料库 | 42B | N/A | 网页 | 可变 | 2021 | ODC-BY |
| **ArabicText-Large** | **244M** | **743K** | **百科** | **高** | **2025** | **Apache 2.0** |
## 局限性
- **方言覆盖**:以现代标准阿拉伯语(MSA)为主,仅包含少量方言变体
- **领域偏差**:以百科类内容为主,无法代表口语或会话阿拉伯语
- **时间覆盖**:内容仅反映至数据集收集截止日期(2025年1月)的知识范围
- **规模权衡**:相较于十亿词级的网页语料库规模更小,但优先保证质量而非数量
## 未来改进计划
拟开展的改进包括:
- 扩展阿拉伯语方言覆盖范围(埃及方言、黎凡特方言、海湾方言、马格里布方言)
- 丰富领域多样性(文学、技术文档、新闻、社交媒体内容)
- 创建平行语料库(阿拉伯语-英语对齐数据)
- 添加语言学标注(词性标注、命名实体识别、依存句法分析)
- 定期更新内容并优化质量
## 许可协议
本数据集采用**Apache License 2.0**协议发布。
Copyright 2025 Jaber Jaber
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
## 引用方式
如果您在研究中使用本数据集,请引用如下内容:
bibtex
@misc{jaber_2025,
author = {Jaber, Jaber},
title = {ArabicText-Large: A High-Quality 244-Million-Word Corpus for Arabic Language Model Training},
year = 2025,
url = {https://huggingface.co/datasets/Jr23xd23/ArabicText-Large},
doi = {10.57967/hf/6685},
publisher = {Hugging Face}
}
**研究论文引用**:
bibtex
@article{jaber2025arabictext,
title={ArabicText-Large: A High-Quality 244-Million-Word Corpus for Arabic Language Model Training},
author={Jaber, Jaber},
journal={Journal of Open Humanities Data},
year={2025},
doi={10.57967/hf/6685},
url={https://huggingface.co/datasets/Jr23xd23/ArabicText-Large}
}
## 贡献方式
我们欢迎社区贡献:
- **Bug报告**:反馈数据质量问题或不一致之处
- **功能请求**:提出数据集改进或扩展建议
- **拉取请求**:贡献预处理增强工具或相关代码
- **反馈**:分享您的使用经验与研究成果
## 联系方式
如有疑问、合作意向或研究咨询,请联系:
**作者**:Jaber Jaber
**所属机构**:RightNow AI
**邮箱**:jaber@rightnowai.co
**官网**:https://www.rightnowai.co
## 致谢
我们衷心感谢:
- 阿拉伯语自然语言处理研究社区提供的宝贵反馈与见解
- 开源贡献者开发的工具与框架,为本项目提供了技术支撑
- 使用本数据集推动阿拉伯语语言技术发展的研究人员与从业者
---
**数据集主页**:[Hugging Face平台上的ArabicText-Large](https://huggingface.co/datasets/Jr23xd23/ArabicText-Large)
**数字对象标识符(DOI)**:[https://doi.org/10.57967/hf/6685](https://doi.org/10.57967/hf/6685)
**许可协议**:Apache 2.0
**作者**:Jaber Jaber
**发布年份**:2025
*推动阿拉伯语自然语言处理研究与发展*
提供机构:
maas
创建时间:
2025-10-09



