Open-Omega-Forge-1M
收藏魔搭社区2026-01-06 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Open-Omega-Forge-1M
下载链接
链接失效反馈官方服务:
资源简介:

# **Open-Omega-Forge-1M**
> Open-Omega-Forge-1M is a carefully curated and optimized collection derived from multiple high-quality datasets, specifically designed to enhance reasoning capabilities across mathematical, scientific, and coding domains. This dataset represents a focused subset that maintains the quality and diversity of reasoning patterns while providing a more manageable size for training and evaluation. A high-quality, compact reasoning dataset designed for mathematics, code, and science applications, with mathematics playing a major role in the dataset composition.
> Mixture of Mathematics, Coding, and Science.
---
## Overview
- **Dataset Name:** Open-Omega-Forge-1M
- **Shaped & Curated by:** PrithivMLMods
- **Size:** 1M examples (~2.17GB in train split [partial])
- **Formats:** `.arrow`, Parquet
- **Languages:** English
- **License:** Apache-2.0
## Key Features
- **Compact and Clean:** Focuses on concise, high-signal problems with clear solutions.
- **Interdisciplinary:** Integrates math, programming, and scientific reasoning tasks.
- **Optimization:** Derived from several leading open datasets, further filtered and enhanced for quality and diversity.
- **Cutoff Emphasis:** Majority of the dataset is math-centric, supporting broader STEM modeling needs.
---
## Quick Start with Hugging Face Datasets🤗
```py
pip install -U datasets
```
```py
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Open-Omega-Forge-1M", split="train")
```
## Dataset Structure
Each entry contains:
- **problem:** The problem statement (text)
- **solution:** The step-by-step, reasoning-focused solution
Example schema:
| Column | Type | Description |
|----------|--------|----------------------------|
| problem | string | Description of the problem |
| solution | string | Reasoning-based solution |
## Response format:
```py
<think>
-- reasoning trace --
</think>
-- answer --
```
---
## Data Sources
Open-Omega-Forge-1M is a carefully curated and optimized derivative collection, sourced from the following open datasets:
- Curated and blended modular dataset from [PrithivMLmods](https://huggingface.co/prithivMLmods). [others]
- XenArcAI/MathX-5M
- nvidia/OpenCodeReasoning
- nvidia/OpenScience
- OpenMathReasoning
> [!note]
The collection also includes custom modular data from prithivMLmods, meticulously blended to ensure quality, balance, and diversity.
## Applications
Open-Omega-Forge-1M is ideal for:
- Training and evaluation of large language models (LLMs) in STEM fields
- Benchmarking math and multi-step reasoning models
- Research in mathematical problem solving, algorithmic code understanding, and scientific reasoning
---
## Citation
If you use this dataset, please cite:
```bitex
Open-Omega-Forge-1M by prithivMLmods
Derived and curated from:
- Curated and blended modular dataset from prithivMLmods. [others]
- XenArcAI/MathX-5M
- nvidia/OpenCodeReasoning
- nvidia/OpenScience
- OpenMathReasoning
```
## License
This dataset is distributed under the Apache-2.0 License. Please review the license terms of all referenced underlying datasets before use.
# **Open-Omega-Forge-1M**

> Open-Omega-Forge-1M 是一款经精心甄选与优化的数据集集合,源自多个高质量开源数据集,专为提升数学、科学与编码领域的推理能力而打造。本数据集为精选子集,既保留了推理模式的质量与多样性,又将规模控制在便于训练与评估的范围内。这是一款高质量、轻量化的推理数据集,面向数学、代码与科学应用,其中数学内容占主要组成部分。
> 涵盖数学、编码与科学领域
---
## 数据集概览
- **数据集名称:** Open-Omega-Forge-1M
- **整理与甄选方:** PrithivMLMods
- **规模:** 100万条样本(训练拆分集[部分]约2.17GB)
- **存储格式:** `.arrow`、Parquet
- **语言:** 英语
- **许可证:** Apache-2.0
## 核心特性
- **轻量化且简洁:** 聚焦于表述精炼、高价值且附带清晰解答的问题
- **跨学科性:** 整合数学、编程与科学推理任务
- **优化处理:** 源自多个主流开源数据集,经进一步筛选以提升质量与多样性
- **侧重数学占比:** 数据集主体以数学内容为核心,可满足更广泛的STEM(科学、技术、工程、数学)建模需求
---
## Hugging Face Datasets 🤗 快速上手
py
pip install -U datasets
py
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Open-Omega-Forge-1M", split="train")
## 数据集结构
每条样本包含以下字段:
- **problem:** 问题描述(文本格式)
- **solution:** 基于推理的分步解答
示例字段结构:
| 字段名 | 类型 | 说明 |
|----------|--------|----------------------------|
| problem | 字符串 | 问题描述 |
| solution | 字符串 | 基于推理的解答 |
## 响应格式:
py
<think>
-- 推理过程 --
</think>
-- 回答内容 --
---
## 数据来源
Open-Omega-Forge-1M 是一款经精心筛选与优化的衍生数据集,其数据源自以下开源数据集:
- 由[PrithivMLMods](https://huggingface.co/prithivMLmods)构建的甄选混合模块化数据集。[其他来源]
- XenArcAI/MathX-5M
- nvidia/OpenCodeReasoning
- nvidia/OpenScience
- OpenMathReasoning
> 【注】本数据集还包含PrithivMLMods定制的模块化数据,经精心混合以确保数据质量、平衡性与多样性。
## 应用场景
Open-Omega-Forge-1M 适用于以下场景:
- 科学、技术、工程、数学(STEM)领域大语言模型(LLM)的训练与评估
- 数学与多步推理模型的基准测试
- 数学问题求解、算法代码理解与科学推理相关研究
---
## 引用方式
若使用本数据集,请按以下方式引用:
bitex
Open-Omega-Forge-1M by prithivMLmods
Derived and curated from:
- Curated and blended modular dataset from prithivMLmods. [others]
- XenArcAI/MathX-5M
- nvidia/OpenCodeReasoning
- nvidia/OpenScience
- OpenMathReasoning
## 许可证
本数据集采用Apache-2.0许可证分发。使用前请查阅所有引用的底层数据集的许可证条款。
提供机构:
maas
创建时间:
2025-07-16



