Atlas-Think-Cot-12M
收藏魔搭社区2026-01-06 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Atlas-Think-Cot-12M
下载链接
链接失效反馈官方服务:
资源简介:

# **Atlas-Think-Cot-12M**
> Atlas-Think-Cot-12M is a large-scale, high-quality reasoning dataset curated for mathematical problem-solving, code generation, and scientific thinking. This dataset emphasizes step-by-step solutions and detailed reasoning, with a major share of mathematical problems guiding its structure and composition.
> Mixture of Mathematics, Coding, and Science. [ <:think>/cot ]
---
# Quick Start with Hugging Face Datasets🤗
```py
pip install -U datasets
```
## 1. Load only the `Think` portion (first 6,186,198 rows) approx.
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("prithivMLmods/Atlas-Think-Cot-12M", split="train")
# Select first 6,186,198 rows for 'think'
think_dataset = dataset.select(range(6_186_198))
print(f"Think dataset size: {len(think_dataset)}")
```
## 2. Load only the `Cot` portion (remaining 6,164,630 rows) approx.
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("prithivMLmods/Atlas-Think-Cot-12M", split="train")
# Select remaining rows for 'cot'
cot_dataset = dataset.select(range(6_186_199, len(dataset)))
print(f"Cot dataset size: {len(cot_dataset)}")
```
---
## Overview
* **Total Samples**: \~12,350,828
* **Split**: `train` only
* **Languages**: English
* **Format**: Apache Arrow (auto-converted to Parquet)
* **License**: Apache-2.0
* **Tags**: `math`, `code`, `science`, `thinking`, `biology`, `chemistry`, `cot`
## Highlights
* Balanced blend of **math**, **science**, and **code** problems, with math as the dominant component.
* Solutions follow a structured "Think step-by-step" format, enabling enhanced chain-of-thought learning.
* Suitable for training reasoning-capable large language models in STEM domains.
## Dataset Structure
Each row contains:
* **`problem`** (string): A question or task drawn from mathematics, computer science, natural sciences, or logical reasoning.
* **`solution`** (string): A thoughtful, explanatory solution beginning with `<think>`, modeling human-like reasoning chains.
## Data Composition
The dataset is conceptually divided into two segments:
* **Think**: First 6,186,198 rows ~ approx.
* **Cot** (Chain of Thought): Remaining 6,164,630 rows ~ approx.
---
## Source & Derivation
This dataset is a curated and optimized blend of the following datasets and internal modular contributions:
* [Poseidon-Reasoning-5M](https://huggingface.co/datasets/prithivMLmods/Poseidon-Reasoning-5M)
* [Helios-R-6M](https://huggingface.co/datasets/prithivMLmods/Helios-R-6M)
* Curated and blended modular dataset from prithivMLmods. [others]
This version has been refined to improve multi-domain reasoning, especially in mathematical and scientific contexts.
## License
Apache License 2.0
# **Atlas-Think-Cot-12M**
> Atlas-Think-Cot-12M是一款面向数学解题、代码生成与科学思维的大规模高质量推理数据集。该数据集聚焦分步求解与详尽推理,其结构与组成以数学类问题为核心主体。
> 融合数学、代码与科学领域数据 [<:think>/cot]
---
# **Hugging Face Datasets🤗 快速上手**
py
pip install -U datasets
## 1. 仅加载`Think`子集(约前6,186,198条数据)
python
from datasets import load_dataset
# 加载完整数据集
dataset = load_dataset("prithivMLmods/Atlas-Think-Cot-12M", split="train")
# 选取前6,186,198条数据作为`Think`子集
think_dataset = dataset.select(range(6_186_198))
print(f"Think dataset size: {len(think_dataset)}")
## 2. 仅加载`Cot`子集(约剩余6,164,630条数据)
python
from datasets import load_dataset
# 加载完整数据集
dataset = load_dataset("prithivMLmods/Atlas-Think-Cot-12M", split="train")
# 选取从6,186,199条开始的剩余数据作为`Cot`子集
cot_dataset = dataset.select(range(6_186_199, len(dataset)))
print(f"Cot dataset size: {len(cot_dataset)}")
---
## 数据集概览
* **总样本量**:约12,350,828条
* **数据集划分**:仅包含`train`划分
* **语言**:英语
* **数据格式**:Apache Arrow(自动转换为Parquet格式)
* **开源协议**:Apache-2.0
* **标签**:`math`(数学)、`code`(代码)、`science`(科学)、`thinking`(思维)、`biology`(生物学)、`chemistry`(化学)、`cot`(思维链)
## 核心亮点
* 均衡融合数学、科学与代码类任务,其中数学类问题占主导地位。
* 解题方案采用标准化的“分步思考”格式,可有效提升思维链(Chain-of-Thought, CoT)学习效果。
* 适用于训练具备STEM领域推理能力的大语言模型(Large Language Model, LLM)。
## 数据集结构
每条数据包含以下字段:
* **`problem`**(字符串类型):源自数学、计算机科学、自然科学或逻辑推理领域的问题或任务描述。
* **`solution`**(字符串类型):以`<think>`开头的详尽解释型解决方案,模拟人类的推理链条。
## 数据构成
该数据集在概念上分为两个子集:
* **Think子集**:约前6,186,198条数据
* **Cot(思维链,Chain of Thought)子集**:约剩余6,164,630条数据
---
## 数据集来源与衍生说明
本数据集是对以下数据集及内部模块化贡献内容进行精选与优化融合后的产物:
* [Poseidon-Reasoning-5M](https://huggingface.co/datasets/prithivMLmods/Poseidon-Reasoning-5M)
* [Helios-R-6M](https://huggingface.co/datasets/prithivMLmods/Helios-R-6M)
* 由prithivMLmods精选融合的模块化数据集 [其他贡献]
本版本已进行优化,以提升多领域推理能力,尤其在数学与科学场景下表现更优。
## 开源协议
Apache License 2.0
提供机构:
maas
创建时间:
2025-07-26



