Open-Omega-Explora-2.5M
收藏魔搭社区2025-12-03 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Open-Omega-Explora-2.5M
下载链接
链接失效反馈官方服务:
资源简介:

# **Open-Omega-Explora-2.5M**
> Open-Omega-Explora-2.5M is a high-quality, large-scale reasoning dataset blending the strengths of both **Open-Omega-Forge-1M** and **Open-Omega-Atom-1.5M**. This unified dataset is crafted for advanced tasks in mathematics, coding, and science reasoning, featuring a robust majority of math-centric examples. Its construction ensures comprehensive coverage and balanced optimization for training, evaluation, and benchmarking in AI research, STEM education, and scientific toolchains.
> Mixture of Mathematics, Coding, and Science.
---
## Overview
- **Dataset Name:** Open-Omega-Explora-2.5M
- **Curated by:** prithivMLmods
- **Size:** ~2.63 million entries (2.18GB in first 5GB train split)
- **Formats:** `.arrow`, Parquet
- **Languages:** English
- **License:** Apache-2.0
## Highlights
- **Comprehensive Coverage:** Merges and enhances the content of both Forge and Atom datasets for superior STEM reasoning diversity and accuracy.
- **Reasoning-Focused:** Prioritizes multi-step, logic-driven problems and thorough explanations.
- **Optimization:** Integrates curated, modular, and open-source contributions for maximum dataset coherence and utility.
- **Emphasis on Math & Science:** High proportion of mathematical reasoning, algorithmic problem-solving, and scientific analysis examples.
---
## Quick Start with Hugging Face Datasets🤗
```py
pip install -U datasets
```
```py
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Open-Omega-Explora-2.5M", split="train")
```
## Dataset Structure
Each entry includes:
- **problem:** The task or question, primarily in mathematics, science, or code.
- **solution:** A clear, stepwise, reasoned answer.
### Schema Example
| Column | Type | Description |
|----------|--------|--------------------------------|
| problem | string | Problem statement/task |
| solution | string | Stepwise, thorough explanation |
## Response format:
```py
<think>
-- reasoning trace --
</think>
-- answer --
```
---
## Data Sources
Open-Omega-Explora-2.5M is a curated, optimized combination of the following core datasets:
- **Open-Omega-Forge-1M**
- Sourced from:
- XenArcAI/MathX-5M
- nvidia/OpenCodeReasoning
- nvidia/OpenScience
- OpenMathReasoning & more
- **Open-Omega-Atom-1.5M**
- Sourced from:
- nvidia/OpenScience
- nvidia/AceReason-1.1-SFT
- nvidia/Nemotron-PrismMath & more
- Custom modular contributions by prithivMLmods
All foundations have been meticulously refined and unified for enhanced reasoning and STEM task coverage.
## Applications
- Training and evaluation of large language models (LLMs) on STEM, logic, and coding tasks
- Benchmark creation for advanced reasoning and problem-solving evaluation
- Research into mathematical, scientific, and code-based intelligence
- Supporting education tools in math, science, and programming
---
## Citation
If you use this dataset, please cite:
```
Open-Omega-Explora-2.5M by prithivMLmods
Derived and curated from:
- Custom modular contributions by prithivMLmods
- Open-Omega-Forge-1M (XenArcAI/MathX-5M, nvidia/OpenCodeReasoning, nvidia/OpenScience, OpenMathReasoning & more)
- Open-Omega-Atom-1.5M (nvidia/OpenScience, nvidia/AceReason-1.1-SFT, nvidia/Nemotron-PrismMath & more)
```
## License
This dataset is provided under the Apache-2.0 License. Ensure compliance with the license terms of all underlying referenced datasets.

# **Open-Omega-Explora-2.5M**
> Open-Omega-Explora-2.5M 是一款高质量大规模推理数据集,融合了**Open-Omega-Forge-1M**与**Open-Omega-Atom-1.5M**的核心优势。这款统一数据集专为数学、编程与科学推理类进阶任务打造,其中以数学为核心的样本占绝大多数。其构建兼顾全面覆盖与均衡优化,可用于人工智能研究、STEM(科学、技术、工程、数学)教育以及科学工具链中的训练、评估与基准测试。
> 涵盖数学、编程与科学三大领域内容。
---
## 数据集概览
- **数据集名称:** Open-Omega-Explora-2.5M
- **整理方:** prithivMLmods
- **样本规模:** 约263万条(前5GB训练拆分数据集大小为2.18GB)
- **存储格式:** .arrow、Parquet
- **语言:** 英语
- **许可证:** Apache-2.0
## 核心亮点
- **全面覆盖:** 合并并优化了Forge与Atom数据集的内容,实现更优异的STEM推理多样性与准确性。
- **聚焦推理:** 优先收录多步骤逻辑驱动型问题与详尽的推导解释。
- **优化设计:** 整合经过精选的模块化开源贡献内容,最大化数据集的一致性与实用价值。
- **侧重数理与科学:** 包含高比例的数学推理、算法问题求解与科学分析类样本。
---
## 使用Hugging Face Datasets🤗快速上手
py
pip install -U datasets
py
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Open-Omega-Explora-2.5M", split="train")
## 数据集结构
每个样本包含以下字段:
- **problem:** 任务或问题,主要涵盖数学、科学或编程领域。
- **solution:** 清晰的分步推导式解答。
### 结构示例
| 列名 | 类型 | 说明 |
|----------|--------|--------------------------|
| problem | 字符串 | 任务描述/问题陈述 |
| solution | 字符串 | 详尽的分步推理解答 |
## 数据来源
Open-Omega-Explora-2.5M是经精选优化的组合数据集,核心来源如下:
- **Open-Omega-Forge-1M**
- 数据来源:
- XenArcAI/MathX-5M
- nvidia/OpenCodeReasoning
- nvidia/OpenScience
- OpenMathReasoning 等
- **Open-Omega-Atom-1.5M**
- 数据来源:
- nvidia/OpenScience
- nvidia/AceReason-1.1-SFT
- nvidia/Nemotron-PrismMath 等
- prithivMLmods 贡献的自定义模块化内容
所有基础数据集均经过精细打磨与统一处理,以增强推理能力与STEM任务覆盖范围。
## 应用场景
- 针对STEM、逻辑与编程任务的大语言模型(Large Language Model,LLM)训练与评估
- 构建进阶推理与问题求解类基准测试集
- 开展数学、科学与代码智能相关研究
- 支撑数学、科学与编程领域的教育工具开发
---
## 引用说明
若使用本数据集,请引用以下内容:
Open-Omega-Explora-2.5M by prithivMLmods
衍生与整理自:
- prithivMLmods 贡献的自定义模块化内容
- Open-Omega-Forge-1M(XenArcAI/MathX-5M、nvidia/OpenCodeReasoning、nvidia/OpenScience、OpenMathReasoning 等)
- Open-Omega-Atom-1.5M(nvidia/OpenScience、nvidia/AceReason-1.1-SFT、nvidia/Nemotron-PrismMath 等)
## 许可证
本数据集采用Apache-2.0许可证发布,请务必遵守所有引用的底层数据集的许可证条款。
提供机构:
maas
创建时间:
2025-07-16



