Smith42/minty-astro-ph
收藏Hugging Face2026-04-23 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Smith42/minty-astro-ph
下载链接
链接失效反馈官方服务:
资源简介:
MINT-1T ArXiv Astro-ph是一个专注于天文学领域的arXiv论文子集,属于MINT-1T-ArXiv数据集的一部分,仅包含`astro-ph`类别的论文(包括交叉列出的论文)。数据集包含约845k篇论文,总大小约804GB,格式为WebDataset tar分片,共287个分片,每个约3GB。每篇论文包含配对的JSON和TIFF文件:JSON文件包含文本段(与图像交错)、图像路径和图像标题;TIFF文件包含渲染的图/页面图像。数据集通过arXiv ID前缀(2007年前的论文)和元数据快照(2007年后的论文)进行过滤,确保包含所有`astro-ph`类别的论文。
MINT-1T ArXiv Astro-ph is an astronomy-focused subset of the MINT-1T-ArXiv dataset, filtered to include only papers from the `astro-ph` arXiv category (including cross-listed papers). The dataset contains ~845k papers with a total size of ~804 GB, formatted as WebDataset tar shards (287 shards, ~3 GB each). Each paper includes paired JSON and TIFF files: the JSON file contains text segments (interleaved with images), image paths, and captions; the TIFF file contains rendered figure/page images. The dataset is filtered using arXiv ID prefixes (pre-2007 papers) and metadata snapshots (post-2007 papers) to ensure all `astro-ph` category papers are included.
提供机构:
Smith42



