five

mudd

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/mudd
下载链接
链接失效反馈
官方服务:
资源简介:
# 🚀 Misraj Unstructured Data Dump (MUDD) *A large-scale Arabic text dataset translated from SlimPajama-627B for pretraining Arabic language models* ## 📚 Dataset Summary MUDD is a substantial dataset comprising *4,758,338 rows* of unstructured, plain Arabic text. Each entry includes Arabic text and retains its UUID from the original source. This dataset provides high-quality Arabic content specifically designed for pretraining large language models (LLMs) and advancing Arabic natural language processing (NLP) research. ## 🌟 Key Features * 📏 **Size**: 4,758,338 rows of Arabic text * 🗣️ **Language**: Arabic (translated from English) * 📌 **Source**: Selected subset of [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B) * 🤖 **Translation Model**: [Mutarjim](https://arxiv.org/abs/2505.17894) * 📝 **Format**: Plain text with UUID identifiers ## 🗃️ Dataset Details ### 🌐 Source Data The foundation of MUDD is [SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B), a high-quality English text dataset containing: * 📈 627 billion tokens * ♻️ Deduplicated content * 🌍 Diverse sources including web pages, Wikipedia, GitHub, and books *Note*: MUDD is derived from a carefully selected subset of SlimPajama-627B, not the complete dataset. ### 🔄 Translation Process The Arabic content was generated using *Mutarjim*, a high-performance Arabic-English translation model built on the [Kuwain-1.5B](https://arxiv.org/abs/2504.15120) architecture. The translation involved: 1. 🛠️ *Pre-training*: On extensive monolingual Arabic and English corpora 2. 🎯 *Fine-tuning*: Using high-quality, human-curated parallel sentence pairs for accurate Arabic translations ## 📂 Dataset Structure ```json { "uuid": { "dtype": "string", "_type": "Value" }, "plain_text": { "dtype": "string", "_type": "Value" } } ``` ## 💡 Usage ### 📥 Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("Misraj/mudd") ``` ### 📋 Example Usage ```python # Access the first example example = dataset['train'][0] print(f"UUID: {example['uuid']}") print(f"Arabic Text: {example['plain_text']}") ``` ## 🎯 Intended Use Cases * 🤗 **Pretraining Arabic LLMs**: Large-scale, high-quality Arabic text corpus for training new language models * 🔍 **Arabic NLP Research**: Supporting research initiatives focused on Arabic language processing * 🚦 **Downstream Applications**: Reliable source for projects requiring extensive and diverse Arabic text data ## 📊 Dataset Statistics | 📏 Metric | 📌 Value | | ----------------- | ------------------------ | | Total Rows | 4,758,338 | | Language | Arabic | | Source | SlimPajama-627B (subset) | | Translation Model | Mutarjim | | Format | Plain text | ## 📖 Citations If you use this dataset, please cite: ```bibtex @misc{misraj2025mudd, title = {Misraj Unstructured Data Dump (MUDD)}, author = {Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati}, year = {2025}, publisher = {MisrajAI}, howpublished = {\url{[https://huggingface.co/datasets/Misraj/mudd](https://huggingface.co/datasets/Misraj/mudd)}} } ```

# 🚀 米斯拉吉非结构化数据转储集(Misraj Unstructured Data Dump, 简称MUDD) *本数据集为面向阿拉伯语大语言模型(LLM)预训练,从SlimPajama-627B翻译而来的大规模阿拉伯语文本数据集* ## 📚 数据集概览 MUDD是一个规模庞大的数据集,包含**4,758,338条**非结构化纯阿拉伯语文本。每条数据均包含阿拉伯文本,并保留了其在原始数据源中的UUID。本数据集专为预训练大语言模型(LLM)以及推动阿拉伯语自然语言处理(Natural Language Processing, 简称NLP)研究而打造,提供高质量阿拉伯语内容。 ## 🌟 核心特性 * 📏 **规模**:4,758,338条阿拉伯语文本 * 🗣️ **语言**:阿拉伯语(源自英文翻译) * 📌 **数据源**:[SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B) 的精选子集 * 🤖 **翻译模型**:[Mutarjim](https://arxiv.org/abs/2505.17894) * 📝 **格式**:带有UUID标识符的纯文本格式 ## 🗃️ 数据集详情 ### 🌐 源数据 MUDD的基础数据源为[SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B),这是一个高质量英文文本数据集,包含: * 📈 6270亿Token * ♻️ 去重内容 * 🌍 多样的数据源,涵盖网页、维基百科、GitHub与图书 *注*:MUDD源自SlimPajama-627B的精心筛选子集,而非完整数据集。 ### 🔄 翻译流程 本数据集的阿拉伯语内容通过Mutarjim生成,这是一款基于[Kuwain-1.5B](https://arxiv.org/abs/2504.15120)架构的高性能阿英翻译模型。翻译流程包含以下步骤: 1. 🛠️ **预训练**:在大规模单语阿拉伯语与英语语料上完成预训练 2. 🎯 **微调**:使用高质量人工标注的平行语句对进行微调,以实现精准的阿拉伯语翻译 ## 📂 数据集结构 json { "uuid": { "dtype": "string", "_type": "Value" }, "plain_text": { "dtype": "string", "_type": "Value" } } ## 💡 使用方法 ### 📥 加载数据集 python from datasets import load_dataset dataset = load_dataset("Misraj/mudd") ### 📋 示例用法 python # 访问第一条样本 example = dataset['train'][0] print(f"UUID: {example['uuid']}") print(f"阿拉伯文本: {example['plain_text']}") ## 🎯 预期应用场景 * 🤗 **阿拉伯语大语言模型预训练**:用于训练新型语言模型的大规模高质量阿拉伯语语料库 * 🔍 **阿拉伯语NLP研究**:支撑阿拉伯语语言处理相关的研究项目 * 🚦 **下游应用**:为需要大规模多样化阿拉伯语文本数据的项目提供可靠数据源 ## 📊 数据集统计指标 | 📏 指标 | 📌 数值 | | ----------------- | ------------------------ | | 总样本数 | 4,758,338 | | 语言 | 阿拉伯语 | | 数据源 | SlimPajama-627B(子集) | | 翻译模型 | Mutarjim | | 格式 | 纯文本 | ## 📖 引用声明 若您使用本数据集,请引用以下文献: bibtex @misc{misraj2025mudd, title = {Misraj Unstructured Data Dump (MUDD)}, author = {Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati}, year = {2025}, publisher = {MisrajAI}, howpublished = {url{https://huggingface.co/datasets/Misraj/mudd}} }
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作