mudd
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/mudd
下载链接
链接失效反馈官方服务:
资源简介:
# 🚀 Misraj Unstructured Data Dump (MUDD)
*A large-scale Arabic text dataset translated from SlimPajama-627B for pretraining Arabic language models*
## 📚 Dataset Summary
MUDD is a substantial dataset comprising *4,758,338 rows* of unstructured, plain Arabic text. Each entry includes Arabic text and retains its UUID from the original source. This dataset provides high-quality Arabic content specifically designed for pretraining large language models (LLMs) and advancing Arabic natural language processing (NLP) research.
## 🌟 Key Features
* 📏 **Size**: 4,758,338 rows of Arabic text
* 🗣️ **Language**: Arabic (translated from English)
* 📌 **Source**: Selected subset of [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B)
* 🤖 **Translation Model**: [Mutarjim](https://arxiv.org/abs/2505.17894)
* 📝 **Format**: Plain text with UUID identifiers
## 🗃️ Dataset Details
### 🌐 Source Data
The foundation of MUDD is [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B), a high-quality English text dataset containing:
* 📈 627 billion tokens
* ♻️ Deduplicated content
* 🌍 Diverse sources including web pages, Wikipedia, GitHub, and books
*Note*: MUDD is derived from a carefully selected subset of SlimPajama-627B, not the complete dataset.
### 🔄 Translation Process
The Arabic content was generated using *Mutarjim*, a high-performance Arabic-English translation model built on the [Kuwain-1.5B](https://arxiv.org/abs/2504.15120) architecture. The translation involved:
1. 🛠️ *Pre-training*: On extensive monolingual Arabic and English corpora
2. 🎯 *Fine-tuning*: Using high-quality, human-curated parallel sentence pairs for accurate Arabic translations
## 📂 Dataset Structure
```json
{
"uuid": {
"dtype": "string",
"_type": "Value"
},
"plain_text": {
"dtype": "string",
"_type": "Value"
}
}
```
## 💡 Usage
### 📥 Loading the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("Misraj/mudd")
```
### 📋 Example Usage
```python
# Access the first example
example = dataset['train'][0]
print(f"UUID: {example['uuid']}")
print(f"Arabic Text: {example['plain_text']}")
```
## 🎯 Intended Use Cases
* 🤗 **Pretraining Arabic LLMs**: Large-scale, high-quality Arabic text corpus for training new language models
* 🔍 **Arabic NLP Research**: Supporting research initiatives focused on Arabic language processing
* 🚦 **Downstream Applications**: Reliable source for projects requiring extensive and diverse Arabic text data
## 📊 Dataset Statistics
| 📏 Metric | 📌 Value |
| ----------------- | ------------------------ |
| Total Rows | 4,758,338 |
| Language | Arabic |
| Source | SlimPajama-627B (subset) |
| Translation Model | Mutarjim |
| Format | Plain text |
## 📖 Citations
If you use this dataset, please cite:
```bibtex
@misc{misraj2025mudd,
title = {Misraj Unstructured Data Dump (MUDD)},
author = {Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati},
year = {2025},
publisher = {MisrajAI},
howpublished = {\url{[https://huggingface.co/datasets/Misraj/mudd](https://huggingface.co/datasets/Misraj/mudd)}}
}
```
# 🚀 米斯拉吉非结构化数据转储集(Misraj Unstructured Data Dump, 简称MUDD)
*本数据集为面向阿拉伯语大语言模型(LLM)预训练,从SlimPajama-627B翻译而来的大规模阿拉伯语文本数据集*
## 📚 数据集概览
MUDD是一个规模庞大的数据集,包含**4,758,338条**非结构化纯阿拉伯语文本。每条数据均包含阿拉伯文本,并保留了其在原始数据源中的UUID。本数据集专为预训练大语言模型(LLM)以及推动阿拉伯语自然语言处理(Natural Language Processing, 简称NLP)研究而打造,提供高质量阿拉伯语内容。
## 🌟 核心特性
* 📏 **规模**:4,758,338条阿拉伯语文本
* 🗣️ **语言**:阿拉伯语(源自英文翻译)
* 📌 **数据源**:[SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的精选子集
* 🤖 **翻译模型**:[Mutarjim](https://arxiv.org/abs/2505.17894)
* 📝 **格式**:带有UUID标识符的纯文本格式
## 🗃️ 数据集详情
### 🌐 源数据
MUDD的基础数据源为[SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B),这是一个高质量英文文本数据集,包含:
* 📈 6270亿Token
* ♻️ 去重内容
* 🌍 多样的数据源,涵盖网页、维基百科、GitHub与图书
*注*:MUDD源自SlimPajama-627B的精心筛选子集,而非完整数据集。
### 🔄 翻译流程
本数据集的阿拉伯语内容通过Mutarjim生成,这是一款基于[Kuwain-1.5B](https://arxiv.org/abs/2504.15120)架构的高性能阿英翻译模型。翻译流程包含以下步骤:
1. 🛠️ **预训练**:在大规模单语阿拉伯语与英语语料上完成预训练
2. 🎯 **微调**:使用高质量人工标注的平行语句对进行微调,以实现精准的阿拉伯语翻译
## 📂 数据集结构
json
{
"uuid": {
"dtype": "string",
"_type": "Value"
},
"plain_text": {
"dtype": "string",
"_type": "Value"
}
}
## 💡 使用方法
### 📥 加载数据集
python
from datasets import load_dataset
dataset = load_dataset("Misraj/mudd")
### 📋 示例用法
python
# 访问第一条样本
example = dataset['train'][0]
print(f"UUID: {example['uuid']}")
print(f"阿拉伯文本: {example['plain_text']}")
## 🎯 预期应用场景
* 🤗 **阿拉伯语大语言模型预训练**:用于训练新型语言模型的大规模高质量阿拉伯语语料库
* 🔍 **阿拉伯语NLP研究**:支撑阿拉伯语语言处理相关的研究项目
* 🚦 **下游应用**:为需要大规模多样化阿拉伯语文本数据的项目提供可靠数据源
## 📊 数据集统计指标
| 📏 指标 | 📌 数值 |
| ----------------- | ------------------------ |
| 总样本数 | 4,758,338 |
| 语言 | 阿拉伯语 |
| 数据源 | SlimPajama-627B(子集) |
| 翻译模型 | Mutarjim |
| 格式 | 纯文本 |
## 📖 引用声明
若您使用本数据集,请引用以下文献:
bibtex
@misc{misraj2025mudd,
title = {Misraj Unstructured Data Dump (MUDD)},
author = {Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati},
year = {2025},
publisher = {MisrajAI},
howpublished = {url{https://huggingface.co/datasets/Misraj/mudd}}
}
提供机构:
maas
创建时间:
2025-07-07



