mudd

Name: mudd
Creator: maas
Published: 2025-12-05 16:41:09
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/Misraj/mudd

下载链接

链接失效反馈

官方服务：

资源简介：

# 🚀 Misraj Unstructured Data Dump (MUDD) *A large-scale Arabic text dataset translated from SlimPajama-627B for pretraining Arabic language models* ## 📚 Dataset Summary MUDD is a substantial dataset comprising *4,758,338 rows* of unstructured, plain Arabic text. Each entry includes Arabic text and retains its UUID from the original source. This dataset provides high-quality Arabic content specifically designed for pretraining large language models (LLMs) and advancing Arabic natural language processing (NLP) research. ## 🌟 Key Features * 📏 **Size**: 4,758,338 rows of Arabic text * 🗣️ **Language**: Arabic (translated from English) * 📌 **Source**: Selected subset of [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B) * 🤖 **Translation Model**: [Mutarjim](https://arxiv.org/abs/2505.17894) * 📝 **Format**: Plain text with UUID identifiers ## 🗃️ Dataset Details ### 🌐 Source Data The foundation of MUDD is [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B), a high-quality English text dataset containing: * 📈 627 billion tokens * ♻️ Deduplicated content * 🌍 Diverse sources including web pages, Wikipedia, GitHub, and books *Note*: MUDD is derived from a carefully selected subset of SlimPajama-627B, not the complete dataset. ### 🔄 Translation Process The Arabic content was generated using *Mutarjim*, a high-performance Arabic-English translation model built on the [Kuwain-1.5B](https://arxiv.org/abs/2504.15120) architecture. The translation involved: 1. 🛠️ *Pre-training*: On extensive monolingual Arabic and English corpora 2. 🎯 *Fine-tuning*: Using high-quality, human-curated parallel sentence pairs for accurate Arabic translations ## 📂 Dataset Structure ```json { "uuid": { "dtype": "string", "_type": "Value" }, "plain_text": { "dtype": "string", "_type": "Value" } } ``` ## 💡 Usage ### 📥 Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("Misraj/mudd") ``` ### 📋 Example Usage ```python # Access the first example example = dataset['train'][0] print(f"UUID: {example['uuid']}") print(f"Arabic Text: {example['plain_text']}") ``` ## 🎯 Intended Use Cases * 🤗 **Pretraining Arabic LLMs**: Large-scale, high-quality Arabic text corpus for training new language models * 🔍 **Arabic NLP Research**: Supporting research initiatives focused on Arabic language processing * 🚦 **Downstream Applications**: Reliable source for projects requiring extensive and diverse Arabic text data ## 📊 Dataset Statistics | 📏 Metric | 📌 Value | | ----------------- | ------------------------ | | Total Rows | 4,758,338 | | Language | Arabic | | Source | SlimPajama-627B (subset) | | Translation Model | Mutarjim | | Format | Plain text | ## 📖 Citations If you use this dataset, please cite: ```bibtex @misc{misraj2025mudd, title = {Misraj Unstructured Data Dump (MUDD)}, author = {Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati}, year = {2025}, publisher = {MisrajAI}, howpublished = {\url{[https://huggingface.co/datasets/Misraj/mudd](https://huggingface.co/datasets/Misraj/mudd)}} } ```

# 🚀 米斯拉吉非结构化数据转储集（Misraj Unstructured Data Dump, 简称MUDD） *本数据集为面向阿拉伯语大语言模型（LLM）预训练，从SlimPajama-627B翻译而来的大规模阿拉伯语文本数据集* ## 📚 数据集概览 MUDD是一个规模庞大的数据集，包含**4,758,338条**非结构化纯阿拉伯语文本。每条数据均包含阿拉伯文本，并保留了其在原始数据源中的UUID。本数据集专为预训练大语言模型（LLM）以及推动阿拉伯语自然语言处理（Natural Language Processing, 简称NLP）研究而打造，提供高质量阿拉伯语内容。 ## 🌟 核心特性 * 📏 **规模**：4,758,338条阿拉伯语文本 * 🗣️ **语言**：阿拉伯语（源自英文翻译） * 📌 **数据源**：[SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的精选子集 * 🤖 **翻译模型**：[Mutarjim](https://arxiv.org/abs/2505.17894) * 📝 **格式**：带有UUID标识符的纯文本格式 ## 🗃️ 数据集详情 ### 🌐 源数据 MUDD的基础数据源为[SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B)，这是一个高质量英文文本数据集，包含： * 📈 6270亿Token * ♻️ 去重内容 * 🌍 多样的数据源，涵盖网页、维基百科、GitHub与图书 *注*：MUDD源自SlimPajama-627B的精心筛选子集，而非完整数据集。 ### 🔄 翻译流程本数据集的阿拉伯语内容通过Mutarjim生成，这是一款基于[Kuwain-1.5B](https://arxiv.org/abs/2504.15120)架构的高性能阿英翻译模型。翻译流程包含以下步骤： 1. 🛠️ **预训练**：在大规模单语阿拉伯语与英语语料上完成预训练 2. 🎯 **微调**：使用高质量人工标注的平行语句对进行微调，以实现精准的阿拉伯语翻译 ## 📂 数据集结构 json { "uuid": { "dtype": "string", "_type": "Value" }, "plain_text": { "dtype": "string", "_type": "Value" } } ## 💡 使用方法 ### 📥 加载数据集 python from datasets import load_dataset dataset = load_dataset("Misraj/mudd") ### 📋 示例用法 python # 访问第一条样本 example = dataset['train'][0] print(f"UUID: {example['uuid']}") print(f"阿拉伯文本: {example['plain_text']}") ## 🎯 预期应用场景 * 🤗 **阿拉伯语大语言模型预训练**：用于训练新型语言模型的大规模高质量阿拉伯语语料库 * 🔍 **阿拉伯语NLP研究**：支撑阿拉伯语语言处理相关的研究项目 * 🚦 **下游应用**：为需要大规模多样化阿拉伯语文本数据的项目提供可靠数据源 ## 📊 数据集统计指标 | 📏 指标 | 📌 数值 | | ----------------- | ------------------------ | | 总样本数 | 4,758,338 | | 语言 | 阿拉伯语 | | 数据源 | SlimPajama-627B（子集） | | 翻译模型 | Mutarjim | | 格式 | 纯文本 | ## 📖 引用声明若您使用本数据集，请引用以下文献： bibtex @misc{misraj2025mudd, title = {Misraj Unstructured Data Dump (MUDD)}, author = {Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati}, year = {2025}, publisher = {MisrajAI}, howpublished = {url{https://huggingface.co/datasets/Misraj/mudd}} }

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集