ar5iv-no-problem-markdown
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/marin-community/ar5iv-no-problem-markdown
下载链接
链接失效反馈官方服务:
资源简介:
# Marin Markdownified Ar5iv
Markdownified Ar5iv transforms academic papers from arXiv into clean, structured Markdown format consisting of **2.74B tokens** across two splits. This dataset preserves th content while making it accessible for language model training on academic text.
| | Value |
|---------------------|-------|
| Tokens | 2 742 463 924 |
| Primary source | https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/ |
| File format | JSONL |
| License | C-UDA-1.0 (mirrors upstream Ar5iv licenses) |
## Processing and Cleaning Pipeline
Our conversion pipeline combines several sophisticated techniques to transform raw Wikipedia HTML into high-quality Markdown:
1. **HTML Preprocessing:** We start with the Ar5iv dump in Extended DOLMA format, which provides HTML representations of academic papers with metadata.
2. **Structural Cleanup**
- The abstract is transformed into a proper section heading for consistent document structure
- LaTeX equations are carefully preserved using inline ($...$) and display ($$...$$) notation
- Code blocks and listings maintain proper formatting with appropriate line breaks
3. **Noise Reduction:**
- Author information is removed
- Title page elements are streamlined to avoid redundancy
- The Ar5iv footer is removed to eliminate conversion metadata
- Figure captions are removed to focus on the main content
- Bibliography sections, footnotes, and citation links are removed
4. **Formatting Cleanup:**
- List items are cleaned to prevent duplicate numbering patterns (e.g., "1. 1.")
- Content before the first main section (typically metadata) is removed
- Equation tables are converted to inline elements for better rendering
5. **DOM Simplification:** We employ a [custom-enhanced version of Resiliparse](https://github.com/stanford-crfm/chatnoir-resiliparse) that preserves semantic HTML structure. Rather than flattening to plain text, we retain important elements like headings, paragraphs, lists while removing scripts, tracking code, and boilerplate.
6. **Markdown Conversion:** Our [custom Markdownify](https://github.com/marin-community/marin/blob/main/marin/markdown/markdown.py#L145-L650) implementation transforms the simplified DOM into clean Markdown. The final output stores each article as a JSON object containing the Markdown text and essential metadata.
## Dataset Variants
The Markdownified Ar5iv dataset comes in two variants:
1. **Ar5iv No Problem (2.74B tokens):** Papers that were converted without significant issues or warnings during the HTML generation process. This subset represents the cleanest and most reliable papers.
2. **Ar5iv Warning (19.6B tokens):** Papers that generated warnings during conversion from LaTeX to HTML. While still valuable, these may contain occasional formatting artifacts.
## Usage Example
```python
from datasets import load_dataset
ds = load_dataset(
"marin-community/ar5iv-no-problem-markdown",
split="train",
streaming=True
)
for article in ds.take(3):
print(article["text"])
```
## Citation
If you use this dataset in your research, please cite both the original Wikipedia contributors and our work:
```
@misc{markdownified_ar5iv_2024,
title = {Markdownified Ar5iv},
author = {The Marin Community},
year = {2024},
url = {https://huggingface.co/datasets/marin-community/ar5iv-no-problem-markdown}
}
```
## License
All content inherits Ar5iv's licensing: C-UDA-1.0. Our conversion tools and pipeline are released under Apache 2.0.
## Acknowledgement
We extend our gratitude to:
- Arxiv Labs and KWARC for their work on the Ar5iv dataset
- Janek Bevendorff for the [Resiliparse project](https://github.com/chatnoir-eu/chatnoir-resiliparse)
- Matthew Dapena-Tretter for [Markdownify](https://github.com/matthewwithanm/python-markdownify)
# Marin团队处理的Markdown化Ar5iv数据集
Markdown化Ar5iv将arXiv平台的学术论文转换为整洁规范、结构清晰的Markdown格式,总标记(Token)数达2742463924,分为两个子集。本数据集在完整保留原文内容的同时,适配大语言模型(Large Language Model)对学术文本的训练需求。
| 类别 | 数值 |
|---------------------|-------|
| 标记(Token)数 | 2742463924 |
| 原始数据源 | https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/ |
| 文件格式 | JSONL |
| 授权协议 | C-UDA-1.0(与上游Ar5iv的授权协议保持一致) |
## 处理与清洗流程
我们的转换流程结合了多种先进技术,将原始HTML转换为高质量Markdown格式:
1. **HTML预处理**
我们从扩展DOLMA(Extended DOLMA)格式的Ar5iv原始转储文件入手,该文件包含带有元数据的学术论文HTML表征。
2. **结构规整**
- 将摘要转换为标准章节标题,以保证文档结构的一致性
- 使用行内(`$...$`)和块级(`$$...$$`)符号完整保留LaTeX公式
- 保留代码块与代码列表的恰当换行与格式
3. **噪声去除**
- 移除作者信息
- 精简标题页元素,避免冗余内容
- 删除Ar5iv页脚以消除转换生成的元数据
- 移除图片说明,聚焦论文核心内容
- 删除参考文献章节、脚注与引用链接
4. **格式清理**
- 清理列表项,防止出现重复编号模式(如`1. 1.`)
- 删除首个主章节前的所有内容(通常为元数据)
- 将公式表格转换为行内元素以优化渲染效果
5. **文档对象模型(DOM)简化**
我们使用[增强版自定义Resiliparse工具](https://github.com/stanford-crfm/chatnoir-resiliparse),该工具可保留语义化的HTML结构。相较于直接将内容扁平化为纯文本,我们保留了标题、段落、列表等关键元素,同时移除了脚本、追踪代码与冗余模板内容。
6. **Markdown转换**
我们基于[自定义实现的Markdownify工具](https://github.com/marin-community/marin/blob/main/marin/markdown/markdown.py#L145-L650),将简化后的DOM树转换为整洁的Markdown格式。最终输出以JSON对象存储每一篇论文,包含Markdown文本与必要的元数据。
## 数据集子集
本Markdown化Ar5iv数据集包含两个变体:
1. **无异常Ar5iv子集(27.4亿标记)**:在HTML生成过程中无显著问题或警告的论文。该子集是最整洁、最可靠的论文集合。
2. **带警告Ar5iv子集(196亿标记)**:在LaTeX转HTML过程中生成过警告的论文。尽管仍具备应用价值,但这类子集可能偶尔存在格式瑕疵。
## 使用示例
python
from datasets import load_dataset
ds = load_dataset(
"marin-community/ar5iv-no-problem-markdown",
split="train",
streaming=True
)
for article in ds.take(3):
print(article["text"])
## 引用说明
若您在研究中使用本数据集,请同时引用原维基百科贡献者与本团队的工作:
@misc{markdownified_ar5iv_2024,
title = {Markdownified Ar5iv},
author = {The Marin Community},
year = {2024},
url = {https://huggingface.co/datasets/marin-community/ar5iv-no-problem-markdown}
}
## 授权协议
所有内容沿用Ar5iv的授权协议:C-UDA-1.0。本团队开发的转换工具与处理流程采用Apache 2.0协议发布。
## 致谢
我们谨向以下团队与个人致谢:
- arXiv Labs与KWARC,感谢其开发Ar5iv数据集
- Janek Bevendorff,感谢其开发[Resiliparse项目](https://github.com/chatnoir-eu/chatnoir-resiliparse)
- Matthew Dapena-Tretter,感谢其开发[Markdownify工具](https://github.com/matthewwithanm/python-markdownify)
提供机构:
maas
创建时间:
2025-10-30



