Poseidon-Reasoning-5M
收藏魔搭社区2025-12-04 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Poseidon-Reasoning-5M
下载链接
链接失效反馈官方服务:
资源简介:

# **Poseidon-Reasoning-5M**
> Poseidon-Reasoning-5M is a high-quality, compact reasoning dataset curated for advanced applications in **mathematics**, **coding**, and **science**. The dataset distinctly emphasizes mathematical and general reasoning challenges, ensuring its suitability for large language model (LLM) research, benchmarking, and STEM-focused educational tools.
---
## Quick Start with Hugging Face Datasets🤗
```py
pip install -U datasets
```
```py
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-5M", split="data")
```
---
## Overview
- **Dataset Name:** Poseidon-Reasoning-5M
- **Curated by:** prithivMLmods
- **Size:** ~5 million entries (approx. 2.3GB in first 5GB split)
- **Formats:** `.arrow`, Parquet (70GB)
- **Languages:** English
- **License:** Apache-2.0
## Key Features
- **High-Quality & Compact:** Carefully selected and concise, focusing on clear, multistep problems with rigorous solutions.
- **Multi-domain:** Integrates mathematics, coding, and science, with a strong bias toward mathematical reasoning and general step-by-step thought.
- **Optimized Sampling:** Includes both custom, modular problems and rigorously filtered slices from state-of-the-art external datasets.
- **Reasoning Depth:** Contains stepwise, logic-driven solutions suitable for model training, evaluation, and academic exploration.
## Dataset Structure
Each record consists of:
- **problem:** The reasoning or problem statement, typically in STEM formats.
- **solution:** A detailed, step-by-step answer or explanation.
**Schema Example:**
| Column | Type | Description |
|----------|--------|---------------------------|
| problem | string | Problem/task statement |
| solution | string | Reasoned, stepwise answer |
---
## Data Sources
Poseidon-Reasoning-5M is an expertly curated and optimized blend of the following major sources:
- glaiveai/reasoning-v1-20m
- prithivMLmods/Open-Omega-Explora-2.5M
- Additional custom modular problems contributed by prithivMLmods
All sources were selected for quality, reasoning rigor, and task diversity. The dataset was further refined to maximize clarity, difficulty balance, and utility for diverse AI applications.
## Applications
Poseidon-Reasoning-5M is ideal for:
- Training and evaluating LLMs on complex, multi-step STEM reasoning
- Benchmarking mathematical, coding, and scientific reasoning capacity
- Research into step-by-step problem solving, algorithmic logic, and analytical skill assessment
- Supporting next-generation STEM education tools and challenge platforms
---
## Citation
If you use this dataset, please cite:
```
Poseidon-Reasoning-5M by prithivMLmods
Derived and curated from:
- Custom modular contributions by prithivMLmods
- glaiveai/reasoning-v1-20m
- prithivMLmods/Open-Omega-Explora-2.5M
```
## License
This dataset is licensed under Apache-2.0. Please consult the license terms of all referenced datasets for additional requirements or attributions.

# **Poseidon-Reasoning-5M**
> Poseidon-Reasoning-5M 是一款高质量、轻量化的推理数据集,专为数学、编程与科学领域的高级应用精心打造。本数据集着重聚焦数学与通用推理挑战,适配大语言模型(LLM)研究、基准测试以及面向STEM(科学、技术、工程、数学)的教育工具开发。
---
## 快速上手(Hugging Face Datasets🤗)
py
pip install -U datasets
py
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-5M", split="data")
---
## 数据集概览
- **数据集名称:** Poseidon-Reasoning-5M
- **整理制作方:** prithivMLmods
- **数据规模:** 约500万条数据(前5GB分块中约占2.3GB)
- **数据格式:** .arrow、Parquet(总大小约70GB)
- **使用语言:** 英语
- **授权协议:** Apache-2.0
## 核心特性
- **高质量且轻量化:** 经过精心筛选与精简,聚焦逻辑清晰的多步问题及严谨的解题方案。
- **多领域覆盖:** 涵盖数学、编程与科学领域,重点倾斜于数学推理与通用分步思考类任务。
- **优化采样策略:** 既包含自定义模块化问题,也从前沿外部数据集中经严格筛选抽取子集。
- **推理深度充足:** 包含分步式、逻辑驱动的解题方案,适配模型训练、评估与学术研究探索。
## 数据集结构
每条数据记录包含以下字段:
- **problem(问题):** 推理或任务描述,通常采用STEM领域标准格式。
- **solution(解答):** 详细的分步式答案或解释说明。
**Schema示例:**
| 字段名 | 数据类型 | 说明 |
|----------|--------|---------------------------|
| problem | 字符串 | 问题/任务描述 |
| solution | 字符串 | 逻辑严谨的分步式解答 |
---
## 数据来源
Poseidon-Reasoning-5M 是由以下主流数据集经专业整理与优化融合而成:
- glaiveai/reasoning-v1-20m
- prithivMLmods/Open-Omega-Explora-2.5M
- 额外由prithivMLmods贡献的自定义模块化问题
所有源数据集均以质量、推理严谨性与任务多样性为筛选标准,本数据集还经过进一步优化,以最大化内容清晰度、难度平衡性以及对各类AI应用的实用价值。
## 应用场景
Poseidon-Reasoning-5M 适用于以下场景:
- 针对复杂多步STEM推理任务的大语言模型训练与评估
- 数学、编程与科学推理能力的基准测试
- 分步式问题求解、算法逻辑与分析能力评估相关研究
- 支撑下一代STEM教育工具与挑战类平台开发
---
## 引用说明
若您使用本数据集,请引用如下内容:
Poseidon-Reasoning-5M by prithivMLmods
Derived and curated from:
- Custom modular contributions by prithivMLmods
- glaiveai/reasoning-v1-20m
- prithivMLmods/Open-Omega-Explora-2.5M
## 授权协议
本数据集采用Apache-2.0开源协议。若需额外使用要求或署名说明,请查阅所有引用数据集的许可条款。
提供机构:
maas
创建时间:
2025-07-18



