GLM-5.1-Reasoning-1M-Cleaned
收藏魔搭社区2026-05-21 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned
下载链接
链接失效反馈官方服务:
资源简介:
# GLM-5.1-Reasoning-1M-Cleaned

**GLM-5.1-Reasoning-1M-Cleaned** is a cleaned and reformatted derivative of [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x). It preserves the original four-subset layout (`main`, `PHD-Science`, `Multilingual-STEM`, `Math`) while converting every example into a unified SFT-ready schema with explicit `conversations`, `input`, `output`, `domain`, and `meta` fields.
This release was prepared from the original dataset published by **Kassadin88**.
## Summary

- Teacher model in the data: **GLM-5.1**
- Total processed records: **766,535**
- Total kept records: **746,321**
- Total removed records: **20,214**
- Subset layout preserved exactly as the source dataset: `main`, `PHD-Science`, `Multilingual-STEM`, `Math`
## Included Content
- `main`: general reasoning and instruction-following data.
- `PHD-Science`: graduate-level physics, chemistry, and biology reasoning traces.
- `Multilingual-STEM`: multilingual STEM reasoning data, including Chinese, English, and other languages present in the source release.
- `Math`: mathematics-heavy reasoning and proof-style responses.
## Cleaning and Reformatting
The raw source dataset mixed two answer layouts:
1. Standard `<think>...</think>` reasoning tags.
2. A non-standard short-dash wrapper around the reasoning section.
This cleaned release normalizes both styles into a single output format:
```json
{
"id": "md5-hash-of-domain-input-reasoning-answer",
"conversations": [
{"from": "human", "value": "user prompt"},
{"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"}
],
"input": "user prompt",
"output": "<think>\nreasoning trace\n</think>\n\nfinal answer",
"domain": "subset/domain name from the original _id prefix",
"meta": {
"input_tokens": 123,
"output_tokens": 456,
"teacher_model": "GLM-5.1"
}
}
```
### Removed data
The cleaning pipeline removed records with:
- incomplete or obviously truncated answers,
- repeated reasoning paragraphs or duplicated answer segments,
- refusal-style answers,
- unparseable reasoning/answer boundaries,
- exact duplicate records after normalization.
## Subset Statistics
| Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 |
| PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 |
| Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 |
| Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 |
## Filter Statistics
| Subset | Issue | Removed |
| --- | --- | ---: |
| main | incomplete_output | 12,726 |
| main | repeated_paragraph | 6,152 |
| main | refusal_answer | 638 |
| main | unparseable_output | 38 |
| main | duplicate_record | 1 |
| PHD-Science | incomplete_output | 33 |
| PHD-Science | repeated_paragraph | 19 |
| PHD-Science | refusal_answer | 1 |
| Multilingual-STEM | repeated_paragraph | 116 |
| Multilingual-STEM | incomplete_output | 88 |
| Multilingual-STEM | refusal_answer | 28 |
| Multilingual-STEM | unparseable_output | 19 |
| Math | repeated_paragraph | 303 |
| Math | incomplete_output | 26 |
| Math | unparseable_output | 26 |
## Additional Token Statistics
| Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens |
| --- | ---: | ---: | ---: | ---: |
| main | 118.62 | 515 | 4482.35 | 15041 |
| PHD-Science | 45.05 | 56 | 4387.73 | 10447 |
| Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 |
| Math | 62.01 | 89 | 28133.4 | 64633 |
## Data Structure
Each example is a single-turn reasoning distillation sample:
- `conversations[0]`: the user prompt.
- `conversations[1]`: the model response with reasoning wrapped in `<think>...</think>` and the final answer after the closing tag.
- `input`: prompt-only view for training pipelines that prefer flat prompt fields.
- `output`: tagged answer-only view for training pipelines that prefer flat completion fields.
- `domain`: original subset/domain name extracted from the source record ID.
- `meta`: lightweight per-example metadata.
## Usage
```python
from datasets import load_dataset
main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main")
science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science")
stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM")
math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math")
```
If you publish under a different namespace, replace `Jackrong/` with your actual Hugging Face username or org.
## Provenance
This dataset is derived from:
- Original dataset: [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x)
- Original author: **Kassadin88**
## Citation
Please cite the original dataset first:
```bibtex
@misc{glm51-1000000x,
title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1},
author={Kassadin88},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x}
}
```
You can additionally cite this cleaned derivative release as:
```bibtex
@misc{glm51_reasoning_1m_cleaned,
title={GLM-5.1-Reasoning-1M-Cleaned},
author={Jackrong},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned}
}
```
# GLM-5.1-Reasoning-1M-Cleaned

**GLM-5.1-Reasoning-1M-Cleaned** 是 [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) 的经过清洗与重新格式化的衍生数据集。其保留了原始数据集的四个子集布局(`main`、`PHD-Science`、`Multilingual-STEM`、`Math`),同时将所有样本转换为统一的监督微调(Supervised Fine-Tuning,SFT)兼容格式,包含明确的`conversations`、`input`、`output`、`domain`与`meta`字段。
本版本由Kassadin88发布的原始数据集整理而来。
## 摘要

- 数据中使用的教师模型:**GLM-5.1**
- 总处理样本数:**766,535**
- 最终保留样本数:**746,321**
- 移除样本数:**20,214**
- 子集布局与源数据集完全一致:`main`、`PHD-Science`、`Multilingual-STEM`、`Math`
## 数据集内容
- `main`:通用推理与指令遵循数据。
- `PHD-Science`:研究生阶段物理、化学与生物学科的推理过程数据。
- `Multilingual-STEM`:多语言STEM推理数据,涵盖源数据集中包含的中文、英文及其他语言。
- `Math`:以数学为主的推理与证明类回复数据。
## 清洗与格式化流程
原始源数据集混合了两种答案格式:
1. 标准的`<think>...</think>`推理标签格式。
2. 使用非标准短横线包裹推理内容的格式。
本次清洗版本将两种格式统一为单一输出格式:
json
{
"id": "md5-hash-of-domain-input-reasoning-answer",
"conversations": [
{"from": "human", "value": "user prompt"},
{"from": "gpt", "value": "<think>
reasoning trace
</think>
final answer"}
],
"input": "user prompt",
"output": "<think>
reasoning trace
</think>
final answer",
"domain": "subset/domain name from the original _id prefix",
"meta": {
"input_tokens": 123,
"output_tokens": 456,
"teacher_model": "GLM-5.1"
}
}
### 移除数据说明
清洗流程会移除以下类型的样本:
- 不完整或明显截断的答案,
- 重复的推理段落或答案片段,
- 拒绝类回复,
- 无法解析的推理/答案边界,
- 归一化后完全重复的样本。
## 子集统计
| 子集名称 | 处理样本数 | 保留样本数 | 移除样本数 | 文件大小 | 输入Token中位数 | 输出Token中位数 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 |
| PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 |
| Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 |
| Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 |
## 过滤统计
| 子集名称 | 问题类型 | 移除样本数 |
| --- | --- | ---: |
| main | 输出不完整 | 12,726 |
| main | 段落重复 | 6,152 |
| main | 拒绝类回复 | 638 |
| main | 输出无法解析 | 38 |
| main | 重复样本 | 1 |
| PHD-Science | 输出不完整 | 33 |
| PHD-Science | 段落重复 | 19 |
| PHD-Science | 拒绝类回复 | 1 |
| Multilingual-STEM | 段落重复 | 116 |
| Multilingual-STEM | 输出不完整 | 88 |
| Multilingual-STEM | 拒绝类回复 | 28 |
| Multilingual-STEM | 输出无法解析 | 19 |
| Math | 段落重复 | 303 |
| Math | 输出不完整 | 26 |
| Math | 输出无法解析 | 26 |
## 额外Token统计
| 子集名称 | 输入Token均值 | 输入Token P95分位数 | 输出Token均值 | 输出Token P95分位数 |
| --- | ---: | ---: | ---: | ---: |
| main | 118.62 | 515 | 4482.35 | 15041 |
| PHD-Science | 45.05 | 56 | 4387.73 | 10447 |
| Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 |
| Math | 62.01 | 89 | 28133.4 | 64633 |
## 数据结构
每个样本为单轮推理蒸馏样本:
- `conversations[0]`:用户提示词。
- `conversations[1]`:模型回复,其中推理过程被包裹在`<think>...</think>`标签中,最终答案紧跟闭合标签之后。
- `input`:仅包含提示词的字段,适配偏好扁平化提示字段的训练流水线。
- `output`:仅包含带标签的答案的字段,适配偏好扁平化补全字段的训练流水线。
- `domain`:从源数据集记录ID前缀中提取的原始子集/领域名称。
- `meta`:单样本轻量级元数据。
## 使用方式
python
from datasets import load_dataset
main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main")
science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science")
stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM")
math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math")
若需在其他命名空间下发布,请将`Jackrong/`替换为您实际的Hugging Face用户名或组织名称。
## 数据集溯源
本数据集衍生自:
- 原始数据集:[`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x)
- 原始作者:**Kassadin88**
## 引用方式
请优先引用原始数据集:
bibtex
@misc{glm51-1000000x,
title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1},
author={Kassadin88},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x}
}
您也可以额外引用本清洗后的衍生版本:
bibtex
@misc{glm51_reasoning_1m_cleaned,
title={GLM-5.1-Reasoning-1M-Cleaned},
author={Jackrong},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned}
}
提供机构:
maas
创建时间:
2026-04-18



