msdd
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/Misraj/msdd
下载链接
链接失效反馈官方服务:
资源简介:
# **📚 Misraj Structured Data Dump (MSDD)**
Misraj Structured Data Dump (MSDD) is a large-scale Arabic multimodal dataset created using our **WASM pipeline**. It is extracted and filtered from the [Common Crawl](https://commoncrawl.org/) dumps and uniquely preserves the structural integrity of web content by providing markdown output. This dataset aims to address the lack of high-quality, structured multimodal data for Arabic and accelerate research in large language and multimodal models.
## **📌 Dataset Summary**
- **Source:** Subset from multiple Common Crawl dumps, processed with the WASM pipeline.
- **Documents:** 23 million documents.
- **Timeframe:** * 2024 Dump: Dump 10, * 2025 Dump: Dump 13
- **Languages:** Primarily Arabic (MSA and dialects).
- **Format:** Multimodal format with interleaved text and images in Markdown.
- **Domain Variety:** General web content.
## **💡 Usage**
### **📥 Loading the Dataset**
```python
from datasets import load_dataset
dataset = load_dataset("Misraj/msdd")
```
### **📋 Example Usage**
```python
# Access the first example
example = dataset['train'][0]
print(f"Text: {example['text']}")
print(f"Images: {example['images']}")
print(f"Captions: {example['image_caption']}")
```
## **⚙️ The WASM Processing Pipeline**
The performance of large language models (LLMs) and large multimodal models (LMMs) depends heavily on the quality and scale of their pre-training datasets. For Arabic, the lack of high-quality multimodal datasets that preserve document structure has limited progress. Our **WASM pipeline** was developed to address this gap by processing Common Crawl and generating a structured, markdown-based multimodal dataset. The pipeline is designed to preserve the structural integrity of web content while maintaining flexibility for both text-only and multimodal pre-training scenarios.
The core of the WASM pipeline involves careful filtering at both the paragraph and document level.
### **✅ _Paragraph-Level Filtering_**
Each paragraph in the corpus undergoes the following checks:
- **Character Deduplication:** Removal of repeated characters beyond a threshold.
- **Word Repetition Ratio:** Filtering paragraphs with excessive word repetitions.
- **Special Character Ratio:** Filtering based on the proportion of non-standard characters.
- **Language Identification:** Only Arabic paragraphs are retained.
- **Perplexity Scoring:** Content scored using an in-house KenLM-based model trained on Wikipedia-like pages, Arabic Twitter data, and dialectal text (e.g., Lahjawi), to remove low-quality text.
### **✅ _Document-Level Filtering_**
Each full document must pass:
- **Word Repetition Ratio:** Similar to paragraph level, but with different thresholds for full documents.
- **Special Character Ratio:** Ensures no document is dominated by symbols, code snippets, or garbage text.
- **Language Identification:** Verifies the document is primarily Arabic.
- **Perplexity Score:** Documents are filtered based on perplexity thresholds to maintain fluent, natural text.
## **📂 Dataset Structure**
The dataset is structured with three main columns to support multimodal tasks. The text is interleaved with image placeholders, allowing for rich text-and-image documents.
- **text**: A string containing the textual content. The special token `<image>` is used to denote the position where an image should be inserted.
- **images**: A list of image URLs (strings). These images correspond sequentially to the `<image>` tokens in the text field.
- **image_caption**: A list of strings, where each string is a caption for the corresponding image in the `images` list. If an image does not have a caption, the list will contain an empty string `''` at that position.
The dataset has the following features:
```json
{
"text": {
"dtype": "string",
"_type": "Value"
},
"images": {
"feature": {
"dtype": "list",
"_type": "Value"
},
"_type": "Sequence"
},
"image_caption": {
"feature": {
"dtype": "list",
"_type": "Value"
},
"_type": "Sequence"
}
}
```
## **🚦 Quality Checks**
The dataset quality was validated using:
- In-house KenLM-based Arabic models for perplexity checks.
- Manual inspection of samples.
- A pipeline inspired by [OBELICS](https://github.com/huggingface/OBELICS), with custom enhancements.
- Comparative analysis against major existing dataset processing pipelines to validate design choices.
## **🔍 Intended Use**
This dataset is intended for:
- Training large-scale multimodal Arabic language models.
- Research on Arabic NLP, including dialect modeling and low-resource language studies.
## **🌐 Availability & Reproducibility**
To support future research and ensure reproducibility, we are publicly releasing this representative dataset dump. The WASM processing pipeline for Arabic will also be made available to the community.
## **📝 Citation**
If you use this dataset, please cite:
```bibtex
@misc{misraj2025msdd,
title = {Misraj Structured Data Dump (MSDD)},
author = {Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati},
year = {2025},
publisher = {MisrajAI},
howpublished = {\url{[https://huggingface.co/datasets/Misraj/msdd](https://huggingface.co/datasets/Misraj/msdd)}}
}
```
# **📚 米斯拉吉结构化数据转储集(MSDD)**
米斯拉吉结构化数据转储集(Misraj Structured Data Dump, MSDD)是我们采用**WASM管道(WASM pipeline)**构建的大规模阿拉伯语多模态数据集。该数据集从[通用爬虫(Common Crawl)](https://commoncrawl.org/)转储文件中提取并过滤,通过输出Markdown格式文件独特地保留了网页内容的结构完整性。本数据集旨在弥补阿拉伯语高质量结构化多模态数据的缺失,推动大语言模型与多模态模型领域的研究进展。
## **📌 数据集概览**
- **数据来源**:源自多个通用爬虫(Common Crawl)转储文件的子集,经WASM管道处理生成。
- **文档数量**:2300万份文档。
- **时间范围**:* 2024年转储:第10批次转储文件,* 2025年转储:第13批次转储文件。
- **语言类型**:以阿拉伯语为主(包括现代标准阿拉伯语与各地方言)。
- **数据格式**:采用Markdown格式的多模态结构,文本与图像交错排布。
- **领域覆盖**:通用网页内容。
## **💡 使用方法**
### **📥 数据集加载**
python
from datasets import load_dataset
dataset = load_dataset("Misraj/msdd")
### **📋 示例用法**
python
# 访问第一条数据样本
example = dataset['train'][0]
print(f"文本:{example['text']}")
print(f"图像:{example['images']}")
print(f"图像标题:{example['image_caption']}")
## **⚙️ WASM处理管道**
大语言模型(Large Language Model, LLM)与多模态大模型(Large Multimodal Model, LMM)的性能高度依赖其预训练数据集的质量与规模。针对阿拉伯语领域,缺乏保留文档结构的高质量多模态数据集这一短板限制了相关研究进展。我们开发的**WASM管道(WASM pipeline)**正是为解决这一痛点,通过处理通用爬虫(Common Crawl)数据并生成基于Markdown的结构化多模态数据集。该管道旨在保留网页内容的结构完整性,同时兼顾纯文本与多模态预训练场景的灵活性需求。
WASM管道的核心在于针对段落与文档两个层级进行精细化过滤。
### **✅ 段落级过滤**
语料库中的每一段落都需通过以下校验:
- **字符去重**:移除超出阈值的重复字符。
- **词汇重复率**:过滤词汇重复度过高的段落。
- **特殊字符占比**:基于非标准字符的占比对段落进行过滤。
- **语言识别**:仅保留阿拉伯语段落。
- **困惑度评分**:采用自研的基于KenLM的模型对内容进行评分,该模型在类维基百科页面、阿拉伯语推特数据及方言文本(如Lahjawi)上训练完成,用于剔除低质量文本。
### **✅ 文档级过滤**
每份完整文档都需满足以下要求:
- **词汇重复率**:规则与段落级一致,但针对完整文档采用不同的阈值。
- **特殊字符占比**:确保文档不会被符号、代码片段或垃圾文本主导。
- **语言识别**:验证文档主体语言为阿拉伯语。
- **困惑度评分**:基于困惑度阈值对文档进行过滤,以保留流畅自然的文本内容。
## **📂 数据集结构**
本数据集包含三个核心字段以支持多模态任务。文本中嵌入了图像占位符,可构建图文结合的丰富文档。
- **text**:存储文本内容的字符串。使用特殊标记`<image>`来标注图像的插入位置。
- **images**:图像URL字符串列表,其中的图像与text字段中的`<image>`标记按顺序一一对应。
- **image_caption**:字符串列表,每个字符串对应images列表中对应图像的标题。若某张图像无标题,则该位置的列表项为空字符串`''`。
本数据集的字段结构如下:
json
{
"text": {
"dtype": "string",
"_type": "Value"
},
"images": {
"feature": {
"dtype": "list",
"_type": "Value"
},
"_type": "Sequence"
},
"image_caption": {
"feature": {
"dtype": "list",
"_type": "Value"
},
"_type": "Sequence"
}
}
## **🚦 质量校验**
本数据集的质量通过以下方式验证:
- 基于自研KenLM的阿拉伯语模型进行困惑度校验。
- 对样本进行人工抽检。
- 采用借鉴[OBELICS](https://github.com/huggingface/OBELICS)开发的处理流程,并加入自定义优化。
- 与主流现有数据集处理流程进行对比分析,以验证本管道的设计合理性。
## **🔍 预期用途**
本数据集适用于以下场景:
- 训练大规模阿拉伯语多模态大语言模型。
- 开展阿拉伯语自然语言处理研究,包括方言建模与低资源语言研究。
## **🌐 可获取性与可复现性**
为支持未来研究并确保可复现性,我们将本代表性数据集转储文件公开发布。针对阿拉伯语的WASM处理管道也将向社区开放。
## **📝 引用格式**
若您使用本数据集,请引用以下文献:
bibtex
@misc{misraj2025msdd,
title = {Misraj Structured Data Dump (MSDD)},
author = {Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan, Ahmed Bustati},
year = {2025},
publisher = {MisrajAI},
howpublished = {url{https://huggingface.co/datasets/Misraj/msdd}}
}
提供机构:
maas
创建时间:
2025-07-07



