arxiv-markdown
收藏魔搭社区2025-12-05 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/arxiv-markdown
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset card for arxiv-markdown
## Dataset Description
* **GitHub:** [https://github.com/marcodsn/academic-chains](https://github.com/marcodsn/academic-chains/tree/main/large_scale)
* **Dataset:** [https://huggingface.co/datasets/marcodsn/arxiv-markdown](https://huggingface.co/datasets/marcodsn/arxiv-markdown) (this page)
> [!Note]
> **[25/04/2025]** Images are now hosted on Cloudflare R2 and referenced in the markdowns as external URLs rather than embedded base64. This significantly reduces storage requirements while maintaining full image access. Images are referenced in the dataset as \!\[Image](url). Community mirrors are welcome!
This dataset contains open-access papers retrieved from [arXiv](https://arxiv.org) and converted to markdown format using [docling](https://github.com/docling-project/docling); specifically, we use the following docling pipeline:
```python
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE # 2.0
pipeline_options.generate_page_images = True # Necessary for the extraction of figures (as far as I understand)
pipeline_options.generate_picture_images = True # For the extraction of figures
pipeline_options.do_code_enrichment = True # For obtaining code blocks
pipeline_options.do_formula_enrichment = True # For converting formulas to LaTeX
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
```
Currently the dataset constains entries extracted from `August 2014`, `August 2019` and `August 2024`; we are currently dedicating an RTX 3090 full-time for the extraction of this data and will continue to upload new entries as they get processed.
You can find the code we are using for the data generation on [GitHub](https://github.com/marcodsn/arxiv-markdown) (optimizations and general suggestions are very welcome!).
## Processed Papers
The distribution of the data currently is like this:
- `2014-08`: **1097** papers (and growing)
- `2019-08`: **994** papers
- `2024-08`: **1178** papers
More papers will be added for already present months and more!
## Dataset Scope
This dataset is being built as the first step to expanding the [academic-chains](https://huggingface.co/datasets/marcodsn/academic-chains) dataset, but other creative uses of this dataset by the community are welcome and encouraged! You can check our license below in this same file.
## Limitations and Biases
* **Extraction Fidelity:** While docling is amazing, it is not perfect, and extraction glitches (especially in tables) may still be present
* **Slow Data Generation:** Document extraction while doing formula and code enrichment and picture extraction is SLOW on our 3090; we didn't use these options while extracting MDs for the academic-chains dataset because it was not strictly necessary, but we would like to do things right this time, even though it will slow us down (relatively)
**Note:** We see there is work going on on supporting batched inference for docling, and updates on using VLMs too, so we will try to keep our pipeline up to date and efficient!
## Acknowledgements
A big-big-BIG thank you to arXiv and to all the authors of the open-access papers present in our dataset (and to all the others too!). And thank you to whoever supported my original academic-chains dataset too, you are the reason why I started working on this so soon and with a smile on my face!
## Licensing Information
This dataset is licensed under the [CC-BY-4.0 License](https://creativecommons.org/licenses/by/4.0/).
## Citation Information
```
@misc{marcodsn_2025_arxivmarkdown,
title = {arxiv-arkdown},
author = {Marco De Santis},
month = {April},
year = {2025},
url = {https://huggingface.co/datasets/marcodsn/arxiv-markdown}
}
```
# arxiv-markdown 数据集卡片
## 数据集说明
* **GitHub 仓库:** [https://github.com/marcodsn/academic-chains](https://github.com/marcodsn/academic-chains/tree/main/large_scale)
* **数据集:** [https://huggingface.co/datasets/marcodsn/arxiv-markdown](https://huggingface.co/datasets/marcodsn/arxiv-markdown)(即本页面)
> [!注意]
> **[2025/04/25]** 目前图片已托管至 Cloudflare R2(Cloudflare R2),并以外部 URL 的形式在 Markdown 中引用,而非内嵌 base64 编码。此举在保留完整图片访问能力的前提下,大幅降低了存储开销。数据集中的图片以 `` 格式进行引用。欢迎社区进行镜像部署!
本数据集包含从 [arXiv(arXiv)](https://arxiv.org) 抓取的开源学术论文,并使用 [docling(docling)](https://github.com/docling-project/docling) 转换为 Markdown 格式;具体使用的 docling 处理流水线如下:
python
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE # 2.0
pipeline_options.generate_page_images = True # 据我理解,这是提取图表的必要参数
pipeline_options.generate_picture_images = True # 用于提取图表
pipeline_options.do_code_enrichment = True # 用于提取代码块
pipeline_options.do_formula_enrichment = True # 用于将公式转换为 LaTeX(LaTeX)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
目前本数据集包含来自 **2014年8月**、**2019年8月** 与 **2024年8月** 的论文条目;我们目前使用一块 RTX 3090(RTX 3090)全天运行数据提取任务,并将在处理完成后持续上传新的数据集条目。
你可以在 [GitHub](https://github.com/marcodsn/arxiv-markdown) 获取我们用于数据生成的代码,欢迎提交优化建议与改进方案!
## 已处理论文分布
当前数据集的数据分布如下:
- `2014-08`:**1097** 篇(仍在持续增长)
- `2019-08`:**994** 篇
- `2024-08`:**1178** 篇
后续将为已有月份及新增月份补充更多论文。
## 数据集应用范围
本数据集作为扩展 [academic-chains(academic-chains)](https://huggingface.co/datasets/marcodsn/academic-chains) 数据集的第一步而构建,同时我们也欢迎并鼓励社区基于本数据集开展其他创造性应用!你可以在本文件下方查看本数据集的许可证信息。
## 局限性与偏差
* **提取保真度:** 尽管 docling 表现出色,但并非完美无缺,仍可能存在提取瑕疵(尤其是表格场景)
* **数据生成速度较慢:** 在同时进行公式、代码与图片提取的情况下,文档提取在我们的 RTX 3090 上运行速度较慢;此前我们在构建 academic-chains 数据集时并未启用这些选项,因为并非必需,但此次我们希望以更严谨的方式处理数据,即便这会相对拖慢整体进度。
**备注:** 我们留意到社区正在推进 docling 的批量推理与视觉语言模型(VLMs)支持相关工作,后续我们将尽力保持本处理流水线的更新与高效!
## 致谢
由衷感谢 arXiv 平台与本数据集收录的所有开源学术论文的作者(以及所有其他支持者)。同时也感谢所有支持我最初的 academic-chains 数据集的朋友们,正是你们让我能够带着热忱快速启动本项目!
## 许可证信息
本数据集采用 [CC-BY-4.0(CC-BY-4.0)许可证](https://creativecommons.org/licenses/by/4.0/) 进行授权。
## 引用信息
@misc{marcodsn_2025_arxivmarkdown,
title = {arxiv-arkdown},
author = {Marco De Santis},
month = {April},
year = {2025},
url = {https://huggingface.co/datasets/marcodsn/arxiv-markdown}
}
提供机构:
maas
创建时间:
2025-04-28



