five

Open-Omega-Forge-1M

收藏
魔搭社区2026-01-06 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Open-Omega-Forge-1M
下载链接
链接失效反馈
官方服务:
资源简介:
![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Te8Q3FEhkWugGLNVts7X_.png) # **Open-Omega-Forge-1M** > Open-Omega-Forge-1M is a carefully curated and optimized collection derived from multiple high-quality datasets, specifically designed to enhance reasoning capabilities across mathematical, scientific, and coding domains. This dataset represents a focused subset that maintains the quality and diversity of reasoning patterns while providing a more manageable size for training and evaluation. A high-quality, compact reasoning dataset designed for mathematics, code, and science applications, with mathematics playing a major role in the dataset composition. > Mixture of Mathematics, Coding, and Science. --- ## Overview - **Dataset Name:** Open-Omega-Forge-1M - **Shaped & Curated by:** PrithivMLMods - **Size:** 1M examples (~2.17GB in train split [partial]) - **Formats:** `.arrow`, Parquet - **Languages:** English - **License:** Apache-2.0 ## Key Features - **Compact and Clean:** Focuses on concise, high-signal problems with clear solutions. - **Interdisciplinary:** Integrates math, programming, and scientific reasoning tasks. - **Optimization:** Derived from several leading open datasets, further filtered and enhanced for quality and diversity. - **Cutoff Emphasis:** Majority of the dataset is math-centric, supporting broader STEM modeling needs. --- ## Quick Start with Hugging Face Datasets🤗 ```py pip install -U datasets ``` ```py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Open-Omega-Forge-1M", split="train") ``` ## Dataset Structure Each entry contains: - **problem:** The problem statement (text) - **solution:** The step-by-step, reasoning-focused solution Example schema: | Column | Type | Description | |----------|--------|----------------------------| | problem | string | Description of the problem | | solution | string | Reasoning-based solution | ## Response format: ```py <think> -- reasoning trace -- </think> -- answer -- ``` --- ## Data Sources Open-Omega-Forge-1M is a carefully curated and optimized derivative collection, sourced from the following open datasets: - Curated and blended modular dataset from [PrithivMLmods](https://huggingface.co/prithivMLmods). [others] - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning > [!note] The collection also includes custom modular data from prithivMLmods, meticulously blended to ensure quality, balance, and diversity. ## Applications Open-Omega-Forge-1M is ideal for: - Training and evaluation of large language models (LLMs) in STEM fields - Benchmarking math and multi-step reasoning models - Research in mathematical problem solving, algorithmic code understanding, and scientific reasoning --- ## Citation If you use this dataset, please cite: ```bitex Open-Omega-Forge-1M by prithivMLmods Derived and curated from: - Curated and blended modular dataset from prithivMLmods. [others] - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning ``` ## License This dataset is distributed under the Apache-2.0 License. Please review the license terms of all referenced underlying datasets before use.

# **Open-Omega-Forge-1M** ![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Te8Q3FEhkWugGLNVts7X_.png) > Open-Omega-Forge-1M 是一款经精心甄选与优化的数据集集合,源自多个高质量开源数据集,专为提升数学、科学与编码领域的推理能力而打造。本数据集为精选子集,既保留了推理模式的质量与多样性,又将规模控制在便于训练与评估的范围内。这是一款高质量、轻量化的推理数据集,面向数学、代码与科学应用,其中数学内容占主要组成部分。 > 涵盖数学、编码与科学领域 --- ## 数据集概览 - **数据集名称:** Open-Omega-Forge-1M - **整理与甄选方:** PrithivMLMods - **规模:** 100万条样本(训练拆分集[部分]约2.17GB) - **存储格式:** `.arrow`、Parquet - **语言:** 英语 - **许可证:** Apache-2.0 ## 核心特性 - **轻量化且简洁:** 聚焦于表述精炼、高价值且附带清晰解答的问题 - **跨学科性:** 整合数学、编程与科学推理任务 - **优化处理:** 源自多个主流开源数据集,经进一步筛选以提升质量与多样性 - **侧重数学占比:** 数据集主体以数学内容为核心,可满足更广泛的STEM(科学、技术、工程、数学)建模需求 --- ## Hugging Face Datasets 🤗 快速上手 py pip install -U datasets py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Open-Omega-Forge-1M", split="train") ## 数据集结构 每条样本包含以下字段: - **problem:** 问题描述(文本格式) - **solution:** 基于推理的分步解答 示例字段结构: | 字段名 | 类型 | 说明 | |----------|--------|----------------------------| | problem | 字符串 | 问题描述 | | solution | 字符串 | 基于推理的解答 | ## 响应格式: py <think> -- 推理过程 -- </think> -- 回答内容 -- --- ## 数据来源 Open-Omega-Forge-1M 是一款经精心筛选与优化的衍生数据集,其数据源自以下开源数据集: - 由[PrithivMLMods](https://huggingface.co/prithivMLmods)构建的甄选混合模块化数据集。[其他来源] - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning > 【注】本数据集还包含PrithivMLMods定制的模块化数据,经精心混合以确保数据质量、平衡性与多样性。 ## 应用场景 Open-Omega-Forge-1M 适用于以下场景: - 科学、技术、工程、数学(STEM)领域大语言模型(LLM)的训练与评估 - 数学与多步推理模型的基准测试 - 数学问题求解、算法代码理解与科学推理相关研究 --- ## 引用方式 若使用本数据集,请按以下方式引用: bitex Open-Omega-Forge-1M by prithivMLmods Derived and curated from: - Curated and blended modular dataset from prithivMLmods. [others] - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning ## 许可证 本数据集采用Apache-2.0许可证分发。使用前请查阅所有引用的底层数据集的许可证条款。
提供机构:
maas
创建时间:
2025-07-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作