five

Open-Omega-Explora-2.5M

收藏
魔搭社区2025-12-03 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Open-Omega-Explora-2.5M
下载链接
链接失效反馈
官方服务:
资源简介:
![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/OcYNWi_UuaHKEu7cUWObK.png) # **Open-Omega-Explora-2.5M** > Open-Omega-Explora-2.5M is a high-quality, large-scale reasoning dataset blending the strengths of both **Open-Omega-Forge-1M** and **Open-Omega-Atom-1.5M**. This unified dataset is crafted for advanced tasks in mathematics, coding, and science reasoning, featuring a robust majority of math-centric examples. Its construction ensures comprehensive coverage and balanced optimization for training, evaluation, and benchmarking in AI research, STEM education, and scientific toolchains. > Mixture of Mathematics, Coding, and Science. --- ## Overview - **Dataset Name:** Open-Omega-Explora-2.5M - **Curated by:** prithivMLmods - **Size:** ~2.63 million entries (2.18GB in first 5GB train split) - **Formats:** `.arrow`, Parquet - **Languages:** English - **License:** Apache-2.0 ## Highlights - **Comprehensive Coverage:** Merges and enhances the content of both Forge and Atom datasets for superior STEM reasoning diversity and accuracy. - **Reasoning-Focused:** Prioritizes multi-step, logic-driven problems and thorough explanations. - **Optimization:** Integrates curated, modular, and open-source contributions for maximum dataset coherence and utility. - **Emphasis on Math & Science:** High proportion of mathematical reasoning, algorithmic problem-solving, and scientific analysis examples. --- ## Quick Start with Hugging Face Datasets🤗 ```py pip install -U datasets ``` ```py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Open-Omega-Explora-2.5M", split="train") ``` ## Dataset Structure Each entry includes: - **problem:** The task or question, primarily in mathematics, science, or code. - **solution:** A clear, stepwise, reasoned answer. ### Schema Example | Column | Type | Description | |----------|--------|--------------------------------| | problem | string | Problem statement/task | | solution | string | Stepwise, thorough explanation | ## Response format: ```py <think> -- reasoning trace -- </think> -- answer -- ``` --- ## Data Sources Open-Omega-Explora-2.5M is a curated, optimized combination of the following core datasets: - **Open-Omega-Forge-1M** - Sourced from: - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning & more - **Open-Omega-Atom-1.5M** - Sourced from: - nvidia/OpenScience - nvidia/AceReason-1.1-SFT - nvidia/Nemotron-PrismMath & more - Custom modular contributions by prithivMLmods All foundations have been meticulously refined and unified for enhanced reasoning and STEM task coverage. ## Applications - Training and evaluation of large language models (LLMs) on STEM, logic, and coding tasks - Benchmark creation for advanced reasoning and problem-solving evaluation - Research into mathematical, scientific, and code-based intelligence - Supporting education tools in math, science, and programming --- ## Citation If you use this dataset, please cite: ``` Open-Omega-Explora-2.5M by prithivMLmods Derived and curated from: - Custom modular contributions by prithivMLmods - Open-Omega-Forge-1M (XenArcAI/MathX-5M, nvidia/OpenCodeReasoning, nvidia/OpenScience, OpenMathReasoning & more) - Open-Omega-Atom-1.5M (nvidia/OpenScience, nvidia/AceReason-1.1-SFT, nvidia/Nemotron-PrismMath & more) ``` ## License This dataset is provided under the Apache-2.0 License. Ensure compliance with the license terms of all underlying referenced datasets.

![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/OcYNWi_UuaHKEu7cUWObK.png) # **Open-Omega-Explora-2.5M** > Open-Omega-Explora-2.5M 是一款高质量大规模推理数据集,融合了**Open-Omega-Forge-1M**与**Open-Omega-Atom-1.5M**的核心优势。这款统一数据集专为数学、编程与科学推理类进阶任务打造,其中以数学为核心的样本占绝大多数。其构建兼顾全面覆盖与均衡优化,可用于人工智能研究、STEM(科学、技术、工程、数学)教育以及科学工具链中的训练、评估与基准测试。 > 涵盖数学、编程与科学三大领域内容。 --- ## 数据集概览 - **数据集名称:** Open-Omega-Explora-2.5M - **整理方:** prithivMLmods - **样本规模:** 约263万条(前5GB训练拆分数据集大小为2.18GB) - **存储格式:** .arrow、Parquet - **语言:** 英语 - **许可证:** Apache-2.0 ## 核心亮点 - **全面覆盖:** 合并并优化了Forge与Atom数据集的内容,实现更优异的STEM推理多样性与准确性。 - **聚焦推理:** 优先收录多步骤逻辑驱动型问题与详尽的推导解释。 - **优化设计:** 整合经过精选的模块化开源贡献内容,最大化数据集的一致性与实用价值。 - **侧重数理与科学:** 包含高比例的数学推理、算法问题求解与科学分析类样本。 --- ## 使用Hugging Face Datasets🤗快速上手 py pip install -U datasets py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Open-Omega-Explora-2.5M", split="train") ## 数据集结构 每个样本包含以下字段: - **problem:** 任务或问题,主要涵盖数学、科学或编程领域。 - **solution:** 清晰的分步推导式解答。 ### 结构示例 | 列名 | 类型 | 说明 | |----------|--------|--------------------------| | problem | 字符串 | 任务描述/问题陈述 | | solution | 字符串 | 详尽的分步推理解答 | ## 数据来源 Open-Omega-Explora-2.5M是经精选优化的组合数据集,核心来源如下: - **Open-Omega-Forge-1M** - 数据来源: - XenArcAI/MathX-5M - nvidia/OpenCodeReasoning - nvidia/OpenScience - OpenMathReasoning 等 - **Open-Omega-Atom-1.5M** - 数据来源: - nvidia/OpenScience - nvidia/AceReason-1.1-SFT - nvidia/Nemotron-PrismMath 等 - prithivMLmods 贡献的自定义模块化内容 所有基础数据集均经过精细打磨与统一处理,以增强推理能力与STEM任务覆盖范围。 ## 应用场景 - 针对STEM、逻辑与编程任务的大语言模型(Large Language Model,LLM)训练与评估 - 构建进阶推理与问题求解类基准测试集 - 开展数学、科学与代码智能相关研究 - 支撑数学、科学与编程领域的教育工具开发 --- ## 引用说明 若使用本数据集,请引用以下内容: Open-Omega-Explora-2.5M by prithivMLmods 衍生与整理自: - prithivMLmods 贡献的自定义模块化内容 - Open-Omega-Forge-1M(XenArcAI/MathX-5M、nvidia/OpenCodeReasoning、nvidia/OpenScience、OpenMathReasoning 等) - Open-Omega-Atom-1.5M(nvidia/OpenScience、nvidia/AceReason-1.1-SFT、nvidia/Nemotron-PrismMath 等) ## 许可证 本数据集采用Apache-2.0许可证发布,请务必遵守所有引用的底层数据集的许可证条款。
提供机构:
maas
创建时间:
2025-07-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作