ansulev/GLM-5.1-Reasoning-1M-Cleaned
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ansulev/GLM-5.1-Reasoning-1M-Cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- zh
size_categories:
- 100K<n<1M
task_categories:
- text-generation
- question-answering
tags:
- reasoning
- chain-of-thought
- instruction-tuning
- sft
- distillation
- glm
- glm-5.1
- cleaned
configs:
- config_name: main
default: true
data_files:
- split: train
path: "main.jsonl"
- config_name: PHD-Science
data_files:
- split: train
path: "PHD-Science.jsonl"
- config_name: Multilingual-STEM
data_files:
- split: train
path: "Multilingual-STEM.jsonl"
- config_name: Math
data_files:
- split: train
path: "Math.jsonl"
---
# GLM-5.1-Reasoning-1M-Cleaned

**GLM-5.1-Reasoning-1M-Cleaned** is a cleaned and reformatted derivative of [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x). It preserves the original four-subset layout (`main`, `PHD-Science`, `Multilingual-STEM`, `Math`) while converting every example into a unified SFT-ready schema with explicit `conversations`, `input`, `output`, `domain`, and `meta` fields.
This release was prepared from the original dataset published by **Kassadin88**.
## Summary

- Teacher model in the data: **GLM-5.1**
- Total processed records: **766,535**
- Total kept records: **746,321**
- Total removed records: **20,214**
- Subset layout preserved exactly as the source dataset: `main`, `PHD-Science`, `Multilingual-STEM`, `Math`
## Included Content
- `main`: general reasoning and instruction-following data.
- `PHD-Science`: graduate-level physics, chemistry, and biology reasoning traces.
- `Multilingual-STEM`: multilingual STEM reasoning data, including Chinese, English, and other languages present in the source release.
- `Math`: mathematics-heavy reasoning and proof-style responses.
## Cleaning and Reformatting
The raw source dataset mixed two answer layouts:
1. Standard `<think>...</think>` reasoning tags.
2. A non-standard short-dash wrapper around the reasoning section.
This cleaned release normalizes both styles into a single output format:
```json
{
"id": "md5-hash-of-domain-input-reasoning-answer",
"conversations": [
{"from": "human", "value": "user prompt"},
{"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"}
],
"input": "user prompt",
"output": "<think>\nreasoning trace\n</think>\n\nfinal answer",
"domain": "subset/domain name from the original _id prefix",
"meta": {
"input_tokens": 123,
"output_tokens": 456,
"teacher_model": "GLM-5.1"
}
}
```
### Removed data
The cleaning pipeline removed records with:
- incomplete or obviously truncated answers,
- repeated reasoning paragraphs or duplicated answer segments,
- refusal-style answers,
- unparseable reasoning/answer boundaries,
- exact duplicate records after normalization.
## Subset Statistics
| Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 |
| PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 |
| Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 |
| Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 |
## Filter Statistics
| Subset | Issue | Removed |
| --- | --- | ---: |
| main | incomplete_output | 12,726 |
| main | repeated_paragraph | 6,152 |
| main | refusal_answer | 638 |
| main | unparseable_output | 38 |
| main | duplicate_record | 1 |
| PHD-Science | incomplete_output | 33 |
| PHD-Science | repeated_paragraph | 19 |
| PHD-Science | refusal_answer | 1 |
| Multilingual-STEM | repeated_paragraph | 116 |
| Multilingual-STEM | incomplete_output | 88 |
| Multilingual-STEM | refusal_answer | 28 |
| Multilingual-STEM | unparseable_output | 19 |
| Math | repeated_paragraph | 303 |
| Math | incomplete_output | 26 |
| Math | unparseable_output | 26 |
## Additional Token Statistics
| Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens |
| --- | ---: | ---: | ---: | ---: |
| main | 118.62 | 515 | 4482.35 | 15041 |
| PHD-Science | 45.05 | 56 | 4387.73 | 10447 |
| Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 |
| Math | 62.01 | 89 | 28133.4 | 64633 |
## Data Structure
Each example is a single-turn reasoning distillation sample:
- `conversations[0]`: the user prompt.
- `conversations[1]`: the model response with reasoning wrapped in `<think>...</think>` and the final answer after the closing tag.
- `input`: prompt-only view for training pipelines that prefer flat prompt fields.
- `output`: tagged answer-only view for training pipelines that prefer flat completion fields.
- `domain`: original subset/domain name extracted from the source record ID.
- `meta`: lightweight per-example metadata.
## Usage
```python
from datasets import load_dataset
main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main")
science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science")
stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM")
math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math")
```
If you publish under a different namespace, replace `Jackrong/` with your actual Hugging Face username or org.
## Provenance
This dataset is derived from:
- Original dataset: [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x)
- Original author: **Kassadin88**
## Citation
Please cite the original dataset first:
```bibtex
@misc{glm51-1000000x,
title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1},
author={Kassadin88},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x}
}
```
You can additionally cite this cleaned derivative release as:
```bibtex
@misc{glm51_reasoning_1m_cleaned,
title={GLM-5.1-Reasoning-1M-Cleaned},
author={Jackrong},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned}
}
```
license: Apache-2.0
language:
- 英语
- 中文
size_categories:
- 10万<样本数<100万
task_categories:
- 文本生成
- 问答
tags:
- 推理
- 思维链
- 指令微调
- 监督微调(SFT,Supervised Fine-Tuning)
- GLM(通用语言模型,General Language Model)
- GLM-5.1
- 已清洗
configs:
- config_name: main
default: true
data_files:
- split: train
path: "main.jsonl"
- config_name: PHD-Science
data_files:
- split: train
path: "PHD-Science.jsonl"
- config_name: Multilingual-STEM
data_files:
- split: train
path: "Multilingual-STEM.jsonl"
- config_name: Math
data_files:
- split: train
path: "Math.jsonl"
# GLM-5.1-Reasoning-1M-Cleaned

**GLM-5.1-Reasoning-1M-Cleaned** 是 [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) 的清洗重构衍生数据集。该数据集保留了原始数据集的四个子集结构(`main`、`PHD-Science`、`Multilingual-STEM`、`Math`),并将所有样本转换为统一的适用于监督微调的标准格式,包含明确的`conversations`、`input`、`output`、`domain`与`meta`字段。
本版本基于**Kassadin88**发布的原始数据集构建。
## 摘要

- 数据中使用的教师模型:**GLM-5.1**
- 总处理样本数:**766,535**
- 总保留样本数:**746,321**
- 总移除样本数:**20,214**
- 严格保留原始数据集的子集结构:`main`、`PHD-Science`、`Multilingual-STEM`、`Math`
## 包含内容
- `main`:通用推理与指令遵循数据
- `PHD-Science`:研究生阶段物理、化学与生物学推理轨迹数据
- `Multilingual-STEM`:多语言STEM推理数据,包含原始发布版本中的中文、英语及其他语言样本
- `Math`:以数学推理与证明风格回复为主的数据集
## 清洗与重构
原始源数据集混合了两种答案格式:
1. 标准的`<think>...</think>`推理标签格式
2. 用非标准短横线包裹推理内容的格式
本次清洗版本将两种格式统一为单一标准输出格式:
json
{
"id": "领域-输入-推理-答案的MD5哈希值",
"conversations": [
{"from": "human", "value": "用户提示词"},
{"from": "gpt", "value": "<think>
推理轨迹
</think>
最终答案"}
],
"input": "用户提示词",
"output": "<think>
推理轨迹
</think>
最终答案",
"domain": "从原始样本ID前缀提取的子集/领域名称",
"meta": {
"input_tokens": 123,
"output_tokens": 456,
"teacher_model": "GLM-5.1"
}
}
### 移除数据
本次清洗流程移除了以下类型的样本:
- 不完整或明显被截断的答案
- 重复的推理段落或重复的答案片段
- 拒绝型回复
- 无法解析的推理/答案边界
- 归一化后完全重复的样本
## 子集统计
| 子集 | 处理样本数 | 保留样本数 | 移除样本数 | 文件大小 | 输入Token中位数 | 输出Token中位数 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 |
| PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 |
| Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 |
| Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 |
## 过滤统计
| 子集 | 过滤问题类型 | 移除数量 |
| --- | --- | ---: |
| main | 输出不完整 | 12,726 |
| main | 段落重复 | 6,152 |
| main | 拒绝型回复 | 638 |
| main | 输出无法解析 | 38 |
| main | 重复样本 | 1 |
| PHD-Science | 输出不完整 | 33 |
| PHD-Science | 段落重复 | 19 |
| PHD-Science | 拒绝型回复 | 1 |
| Multilingual-STEM | 段落重复 | 116 |
| Multilingual-STEM | 输出不完整 | 88 |
| Multilingual-STEM | 拒绝型回复 | 28 |
| Multilingual-STEM | 输出无法解析 | 19 |
| Math | 段落重复 | 303 |
| Math | 输出不完整 | 26 |
| Math | 输出无法解析 | 26 |
## 额外Token统计
| 子集 | 输入Token均值 | 输入Token 95分位数 | 输出Token均值 | 输出Token 95分位数 |
| --- | ---: | ---: | ---: | ---: |
| main | 118.62 | 515 | 4482.35 | 15041 |
| PHD-Science | 45.05 | 56 | 4387.73 | 10447 |
| Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 |
| Math | 62.01 | 89 | 28133.4 | 64633 |
## 数据结构
每个样本均为单轮推理蒸馏样本:
- `conversations[0]`:用户提示词
- `conversations[1]`:模型回复,其中推理内容被包裹在`<think>...</think>`标签中,最终答案紧随闭合标签之后
- `input`:仅包含提示词的字段,适配偏好扁平化提示字段的训练流程
- `output`:仅包含带标签答案的字段,适配偏好扁平化补全字段的训练流程
- `domain`:从原始样本ID中提取的原始子集/领域名称
- `meta`:单样本轻量级元数据
## 使用方法
python
from datasets import load_dataset
main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main")
science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science")
stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM")
math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math")
若您在其他命名空间下发布本数据集,请将`Jackrong/`替换为您的Hugging Face用户名或组织名称。
## 数据集来源
本数据集衍生自:
- 原始数据集:[`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x)
- 原始作者:**Kassadin88**
## 引用方式
请优先引用原始数据集:
bibtex
@misc{glm51-1000000x,
title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1},
author={Kassadin88},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x}
}
您也可以额外引用本清洗衍生版本:
bibtex
@misc{glm51_reasoning_1m_cleaned,
title={GLM-5.1-Reasoning-1M-Cleaned},
author={Jackrong},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned}
}
提供机构:
ansulev



