five

ar5iv-warning-markdown

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/marin-community/ar5iv-warning-markdown
下载链接
链接失效反馈
官方服务:
资源简介:
# Marin Markdownified Ar5iv Markdownified Ar5iv transforms academic papers from arXiv into clean, structured Markdown format consisting of **22.34B tokens** across two splits. This dataset preserves th content while making it accessible for language model training on academic text. | | Value | |---------------------|-------| | Tokens | 19 552 307 274 | | Primary source | https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/ | | File format | JSONL | | License | C-UDA-1.0 (mirrors upstream Ar5iv licenses) | ## Processing and Cleaning Pipeline Our conversion pipeline combines several sophisticated techniques to transform raw Wikipedia HTML into high-quality Markdown: 1. **HTML Preprocessing:** We start with the Ar5iv dump in Extended DOLMA format, which provides HTML representations of academic papers with metadata. 2. **Structural Cleanup** - The abstract is transformed into a proper section heading for consistent document structure - LaTeX equations are carefully preserved using inline ($...$) and display ($$...$$) notation - Code blocks and listings maintain proper formatting with appropriate line breaks 3. **Noise Reduction:** - Author information is removed - Title page elements are streamlined to avoid redundancy - The Ar5iv footer is removed to eliminate conversion metadata - Figure captions are removed to focus on the main content - Bibliography sections, footnotes, and citation links are removed 4. **Formatting Cleanup:** - List items are cleaned to prevent duplicate numbering patterns (e.g., "1. 1.") - Content before the first main section (typically metadata) is removed - Equation tables are converted to inline elements for better rendering 5. **DOM Simplification:** We employ a [custom-enhanced version of Resiliparse](https://github.com/stanford-crfm/chatnoir-resiliparse) that preserves semantic HTML structure. Rather than flattening to plain text, we retain important elements like headings, paragraphs, lists while removing scripts, tracking code, and boilerplate. 6. **Markdown Conversion:** Our [custom Markdownify](https://github.com/marin-community/marin/blob/main/marin/markdown/markdown.py#L145-L650) implementation transforms the simplified DOM into clean Markdown. The final output stores each article as a JSON object containing the Markdown text and essential metadata. ## Dataset Variants The Markdownified Ar5iv dataset comes in two variants: 1. **Ar5iv No Problem (2.74B tokens):** Papers that were converted without significant issues or warnings during the HTML generation process. This subset represents the cleanest and most reliable papers. 2. **Ar5iv Warning (19.6B tokens):** Papers that generated warnings during conversion from LaTeX to HTML. While still valuable, these may contain occasional formatting artifacts. ## Usage Example ```python from datasets import load_dataset ds = load_dataset( "marin-community/ar5iv-warning-markdown", split="train", streaming=True ) for article in ds.take(3): print(article["text"]) ``` ## Citation If you use this dataset in your research, please cite both the original Wikipedia contributors and our work: ``` @misc{markdownified_ar5iv_2024, title = {Markdownified Ar5iv}, author = {The Marin Community}, year = {2024}, url = {https://huggingface.co/datasets/marin-community/ar5iv-warning-markdown} } ``` ## License All content inherits Ar5iv's licensing: C-UDA-1.0. Our conversion tools and pipeline are released under Apache 2.0. ## Acknowledgement We extend our gratitude to: - Arxiv Labs and KWARC for their work on the Ar5iv dataset - Janek Bevendorff for the [Resiliparse project](https://github.com/chatnoir-eu/chatnoir-resiliparse) - Matthew Dapena-Tretter for [Markdownify](https://github.com/matthewwithanm/python-markdownify)

# Marin Markdown化Ar5iv数据集(Marin Markdownified Ar5iv) Markdown化Ar5iv将arXiv上的学术论文转换为整洁、结构化的Markdown格式,该数据集包含两个子集,总计**22.34亿个Token(令牌)**。本数据集在保留原文内容的同时,为大语言模型(Large Language Model)的学术文本训练提供了可访问的数据基础。 | | 数值 | |---------------------|-------| | Token总数 | 19 552 307 274 | | 原始数据源 | https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/ | | 文件格式 | JSONL | | 许可协议 | C-UDA-1.0(与上游Ar5iv许可一致) | ## 处理与清洗流程 我们的转换流程结合了多种先进技术,可将原始维基百科HTML转换为高质量的Markdown格式: 1. **HTML预处理**:我们从扩展DOLMA格式(Extended DOLMA Format)的Ar5iv原始转储文件出发,该文件包含带有元数据的学术论文HTML表征。 2. **结构清理** - 将摘要转换为规范的章节标题,以保证文档结构的一致性 - 采用行内($...$)和块级($$...$$)语法保留LaTeX公式 - 代码块与代码列表保留正确的格式与换行符 3. **降噪处理**: - 移除作者信息 - 简化标题页元素以避免冗余 - 删除Ar5iv页脚以移除转换元数据 - 移除图片说明文字以聚焦核心内容 - 删除参考文献章节、脚注与引用链接 4. **格式清理**: - 清理列表项以避免重复编号模式(如"1. 1.") - 删除首个主章节前的内容(通常为元数据) - 将公式表格转换为行内元素以优化渲染效果 5. **DOM简化**:我们采用[Resiliparse的定制增强版](https://github.com/stanford-crfm/chatnoir-resiliparse),该工具可保留语义化HTML结构。相较于将文档扁平化为纯文本,我们保留了标题、段落、列表等重要元素,同时移除了脚本、跟踪代码与冗余模板内容。 6. **Markdown转换**:我们采用[定制化Markdownify工具](https://github.com/marin-community/marin/blob/main/marin/markdown/markdown.py#L145-L650),将简化后的DOM转换为整洁的Markdown格式。最终输出将每篇论文存储为一个JSON对象,包含Markdown文本与必要的元数据。 ## 数据集变体 Markdown化Ar5iv数据集包含两种变体: 1. **Ar5iv无异常子集(2.74亿个Token)**:在HTML生成过程中无显著问题或警告的论文。该子集是最干净、最可靠的论文集合。 2. **Ar5iv带警告子集(19.6亿个Token)**:在LaTeX转HTML转换过程中生成警告的论文。尽管仍具备使用价值,但此类子集可能包含少量格式瑕疵。 ## 使用示例 python from datasets import load_dataset ds = load_dataset( "marin-community/ar5iv-warning-markdown", split="train", streaming=True ) for article in ds.take(3): print(article["text"]) ## 引用说明 若您在研究中使用本数据集,请同时引用原维基百科贡献者与本团队的工作: bibtex @misc{markdownified_ar5iv_2024, title = {Markdownified Ar5iv}, author = {The Marin Community}, year = {2024}, url = {https://huggingface.co/datasets/marin-community/ar5iv-warning-markdown} } ## 许可协议 所有数据集内容继承Ar5iv的许可协议:C-UDA-1.0。本团队的转换工具与流程采用Apache 2.0许可。 ## 致谢 我们衷心感谢: - arXiv实验室与KWARC团队开发的Ar5iv数据集 - Janek Bevendorff开发的[Resiliparse项目](https://github.com/chatnoir-eu/chatnoir-resiliparse) - Matthew Dapena-Tretter开发的[Markdownify工具](https://github.com/matthewwithanm/python-markdownify)
提供机构:
maas
创建时间:
2025-10-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作