sanniukin/Sci-Base
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/sanniukin/Sci-Base
下载链接
链接失效反馈官方服务:
资源简介:
Sci-Base是一个大规模、纯客观的科学知识库数据集,属于Sciverse科学数据基础的核心组成部分。该数据集包含超过2500万份经过深度清洗和解析的开放获取科学文档(包括论文和书籍),通过MinerU智能文档解析引擎进行“像素级”数字重建,将复杂数学方程、化学公式和高精度图表等碎片化学术文档转化为超过6000亿个真正AI就绪的纯令牌。它完整保留了科学文献中的逻辑链和原始排版结构,覆盖10个核心科学学科:数学与计算科学、物理学、化学、生命科学、地球与大气科学、天文学与空间科学、医学与健康科学、材料科学与工程、能源与动力科学、工程与制造科学。数据集知识截止日期至2026年3月,是目前同类数据集中规模最大的,专注于为科学人工智能(AI4S)社区提供高质量数据基础设施。
Sci-Base is a massive-scale, purely objective scientific knowledge base dataset, serving as a core component of the Sciverse scientific data foundation. It comprises over 25 million deeply cleaned and parsed Open Access scientific documents (including papers and books), digitally reconstructed at a pixel-level through the MinerU intelligent document parsing engine. This process transforms fragmented academic documents with complex mathematical equations, chemical formulas, and high-precision charts into over 600 billion truly AI-ready, pure tokens. It flawlessly preserves the logical chains and original typographical structures inherent in scientific literature, covering 10 core scientific disciplines: Mathematics and Computational Science, Physics, Chemistry, Life Sciences, Earth and Atmospheric Sciences, Astronomy and Space Sciences, Medicine and Health Sciences, Materials Science and Engineering, Energy and Power Science, and Engineering and Manufacturing Science. With a knowledge cutoff extending to March 2026, it is the largest dataset of its kind currently available, designed to provide high-quality data infrastructure for the AI for Science (AI4S) community.
提供机构:
sanniukin



