Infi-MM/InfiMM-WebMath-40B
收藏Hugging Face2025-07-26 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/Infi-MM/InfiMM-WebMath-40B
下载链接
链接失效反馈官方服务:
资源简介:
InfiMM-WebMath-40B是一个大规模、开源的多模态数据集,专门设计用于数学推理任务。它包含从网页文档中提取的文本和图像,旨在推动多模态大语言模型(MLLMs)的预训练。数据集支持复杂的推理任务,涉及对文本和视觉元素(如图表、图形和几何图)的理解。数据集包括2400万网页文档、8500万图像URL和400亿文本标记,主要来源于2019年至2023年的Common Crawl数据快照,并经过多阶段过滤和提取过程以确保高质量。
InfiMM-WebMath-40B is a large-scale, open-source multimodal dataset specifically designed for mathematical reasoning tasks. It incorporates both text and images, extracted from web documents, to advance the pre-training of Multimodal Large Language Models (MLLMs). The dataset is tailored to support sophisticated reasoning tasks that involve understanding both text and visual elements like diagrams, figures, and geometric plots. The dataset includes 24 million web documents, 85 million image URLs, and 40 billion text tokens, sourced from Common Crawl data snapshots (2019–2023), and has undergone a multi-stage filtering and extraction process to ensure high quality.
提供机构:
Infi-MM



