Open-Omega-Explora-2.5M

Name: Open-Omega-Explora-2.5M
Creator: maas
Published: 2025-12-03 17:17:25
License: 暂无描述

魔搭社区2025-12-03 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/prithivMLmods/Open-Omega-Explora-2.5M

下载链接

链接失效反馈

官方服务：

资源简介：

![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/OcYNWi_UuaHKEu7cUWObK.png) # **Open-Omega-Explora-2.5M** > Open-Omega-Explora-2.5M is a high-quality, large-scale reasoning dataset blending the strengths of both **Open-Omega-Forge-1M** and **Open-Omega-Atom-1.5M**. This unified dataset is crafted for advanced tasks in mathematics, coding, and science reasoning, featuring a robust majority of math-centric examples. Its construction ensures comprehensive coverage and balanced optimization for training, evaluation, and benchmarking in AI research, STEM education, and scientific toolchains. > Mixture of Mathematics, Coding, and Science. --- ## Overview - **Dataset Name:** Open-Omega-Explora-2.5M - **Curated by:** prithivMLmods - **Size:** ~2.63 million entries (2.18GB in first 5GB train split) - **Formats:** `.arrow`, Parquet - **Languages:** English - **License:** Apache-2.0 ## Highlights - **Comprehensive Coverage:** Merges and enhances the content of both Forge and Atom datasets for superior STEM reasoning diversity and accuracy. - **Reasoning-Focused:** Prioritizes multi-step, logic-driven problems and thorough explanations. - **Optimization:** Integrates curated, modular, and open-source contributions for maximum dataset coherence and utility. - **Emphasis on Math & Science:** High proportion of mathematical reasoning, algorithmic problem-solving, and scientific analysis examples. --- ## Quick Start with Hugging Face Datasets🤗 ```py pip install -U datasets ``` ```py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Open-Omega-Explora-2.5M", split="train") ``` ## Dataset Structure Each entry includes: - **problem:** The task or question, primarily in mathematics, science, or code. - **solution:** A clear, stepwise, reasoned answer. ### Schema Example | Column | Type | Description | |----------|--------|--------------------------------| | problem | string | Problem statement/task | | solution | string | Stepwise, thorough explanation | ## Response format: ```py <think> -- reasoning trace -- </think> -- answer -- ``` --- ## Data Sources Open-Omega-Explora-2.5M is a curated, optimized combination of the following core datasets: - **Open-Omega-Forge-1M** - Sourced from: - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning & more - **Open-Omega-Atom-1.5M** - Sourced from: - nvidia/OpenScience - nvidia/AceReason-1.1-SFT - nvidia/Nemotron-PrismMath & more - Custom modular contributions by prithivMLmods All foundations have been meticulously refined and unified for enhanced reasoning and STEM task coverage. ## Applications - Training and evaluation of large language models (LLMs) on STEM, logic, and coding tasks - Benchmark creation for advanced reasoning and problem-solving evaluation - Research into mathematical, scientific, and code-based intelligence - Supporting education tools in math, science, and programming --- ## Citation If you use this dataset, please cite: ``` Open-Omega-Explora-2.5M by prithivMLmods Derived and curated from: - Custom modular contributions by prithivMLmods - Open-Omega-Forge-1M (XenArcAI/MathX-5M, nvidia/OpenCodeReasoning, nvidia/OpenScience, OpenMathReasoning & more) - Open-Omega-Atom-1.5M (nvidia/OpenScience, nvidia/AceReason-1.1-SFT, nvidia/Nemotron-PrismMath & more) ``` ## License This dataset is provided under the Apache-2.0 License. Ensure compliance with the license terms of all underlying referenced datasets.

![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/OcYNWi_UuaHKEu7cUWObK.png) # **Open-Omega-Explora-2.5M** > Open-Omega-Explora-2.5M 是一款高质量大规模推理数据集，融合了**Open-Omega-Forge-1M**与**Open-Omega-Atom-1.5M**的核心优势。这款统一数据集专为数学、编程与科学推理类进阶任务打造，其中以数学为核心的样本占绝大多数。其构建兼顾全面覆盖与均衡优化，可用于人工智能研究、STEM（科学、技术、工程、数学）教育以及科学工具链中的训练、评估与基准测试。 > 涵盖数学、编程与科学三大领域内容。 --- ## 数据集概览 - **数据集名称：** Open-Omega-Explora-2.5M - **整理方：** prithivMLmods - **样本规模：** 约263万条（前5GB训练拆分数据集大小为2.18GB） - **存储格式：** .arrow、Parquet - **语言：** 英语 - **许可证：** Apache-2.0 ## 核心亮点 - **全面覆盖：** 合并并优化了Forge与Atom数据集的内容，实现更优异的STEM推理多样性与准确性。 - **聚焦推理：** 优先收录多步骤逻辑驱动型问题与详尽的推导解释。 - **优化设计：** 整合经过精选的模块化开源贡献内容，最大化数据集的一致性与实用价值。 - **侧重数理与科学：** 包含高比例的数学推理、算法问题求解与科学分析类样本。 --- ## 使用Hugging Face Datasets🤗快速上手 py pip install -U datasets py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Open-Omega-Explora-2.5M", split="train") ## 数据集结构每个样本包含以下字段： - **problem：** 任务或问题，主要涵盖数学、科学或编程领域。 - **solution：** 清晰的分步推导式解答。 ### 结构示例 | 列名 | 类型 | 说明 | |----------|--------|--------------------------| | problem | 字符串 | 任务描述/问题陈述 | | solution | 字符串 | 详尽的分步推理解答 | ## 数据来源 Open-Omega-Explora-2.5M是经精选优化的组合数据集，核心来源如下： - **Open-Omega-Forge-1M** - 数据来源： - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning 等 - **Open-Omega-Atom-1.5M** - 数据来源： - nvidia/OpenScience - nvidia/AceReason-1.1-SFT - nvidia/Nemotron-PrismMath 等 - prithivMLmods 贡献的自定义模块化内容所有基础数据集均经过精细打磨与统一处理，以增强推理能力与STEM任务覆盖范围。 ## 应用场景 - 针对STEM、逻辑与编程任务的大语言模型（Large Language Model，LLM）训练与评估 - 构建进阶推理与问题求解类基准测试集 - 开展数学、科学与代码智能相关研究 - 支撑数学、科学与编程领域的教育工具开发 --- ## 引用说明若使用本数据集，请引用以下内容： Open-Omega-Explora-2.5M by prithivMLmods 衍生与整理自： - prithivMLmods 贡献的自定义模块化内容 - Open-Omega-Forge-1M（XenArcAI/MathX-5M、nvidia/OpenCodeReasoning、nvidia/OpenScience、OpenMathReasoning 等） - Open-Omega-Atom-1.5M（nvidia/OpenScience、nvidia/AceReason-1.1-SFT、nvidia/Nemotron-PrismMath 等） ## 许可证本数据集采用Apache-2.0许可证发布，请务必遵守所有引用的底层数据集的许可证条款。

提供机构：

maas

创建时间：

2025-07-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集