five

OpenDataArena/MathLake

收藏
Hugging Face2026-04-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/OpenDataArena/MathLake
下载链接
链接失效反馈
官方服务:
资源简介:
MathLake是一个大规模的数学问题数据集,汇集了来自50多个开源数据集的830万个数学问题。与专注于立即筛选最高质量解决方案的数据集不同,MathLake优先考虑查询的全面性,作为研究人员进一步整理、提炼或注释的通用“原始矿石”。该数据集提供了难度、格式和学科的注释,涵盖了从基础算术到高级主题的多样化数学领域。数据集的建设过程包括严格的来源选择、查询去重、清洗和答案提取。每个记录都标准化为包含问题、来源、原始解决方案、提取的答案、学科、格式和难度等字段的结构。此外,数据集还通过专门的LLM注释流程生成了元数据,包括学科分布、难度等级和格式分类。

MathLake is a massive collection of 8.3 million mathematical problems aggregated from over 50 open-source datasets. Unlike datasets focused on filtering for the highest quality solutions immediately, MathLake prioritizes query comprehensiveness, serving as a universal "raw ore" for researchers to curate, distill, or annotate further. The dataset provides annotations for Difficulty, Format, and Subject, covering diverse mathematical fields from basic arithmetic to advanced topics. The data construction involved rigorous dataset selection, query deduplication, cleaning, and answer extraction. Each record is standardized to include fields such as question, source, original response, extracted answer, subject, format, and difficulty. Additionally, the dataset features metadata annotations generated through a specialized LLM-based pipeline, including subject distribution, difficulty levels, and format classifications.
提供机构:
OpenDataArena
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作