Kassadin88/GLM-5.1-1000000x
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Kassadin88/GLM-5.1-1000000x
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- zh
size_categories:
- n>1M
task_categories:
- text-generation
- question-answering
tags:
- reasoning
- chain-of-thought
- instruction-tuning
- sft
- distillation
- glm
- glm-5.1
configs:
- config_name: main
data_files:
- split: train
path: "main.jsonl"
- config_name: PHD-Science
data_files:
- split: train
path: "PHD-Science.jsonl"
- config_name: Multilingual-STEM
data_files:
- split: train
path: "Multilingual-STEM.jsonl"
- config_name: Math
data_files:
- split: train
path: "Math.jsonl"
---
<div align="center">
<img src="https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/logo.svg" width="15%" />
</div>
# GLM-5.1-1000000x
**1,003,589** reasoning traces distilled by **GLM-5.1**, using questions from [KIMI-K2.5-1000000x](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x).
Each entry contains a full chain-of-thought reasoning trace followed by the final answer, generated by GLM-5.1.
> **Complete!** All 1,003,589 prompts distilled successfully.
>
> ████████████████████████████████ 100%
---
## Data Distribution
| Subset | Count | Proportion | Est. Tokens | Domain |
|--------|------:|:----------:|:-----------:|--------|
| main | 598,366 | 59.6% | ~3.04B | General reasoning & instruction-following |
| Math | 208,426 | 20.8% | ~1.30B | Mathematics |
| PHD-Science | 103,759 | 10.3% | ~0.56B | Graduate-level Physics, Chemistry, Biology |
| Multilingual-STEM | 93,038 | 9.3% | ~0.46B | STEM in Chinese, English & other languages |
| **Total** | **1,003,589** | **100%** | **~5.36B** | |
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total Records | 1,003,589 |
| Total Estimated Tokens | ~5.36B |
| Avg. Tokens per Record | ~5,338 |
## How to Use
```python
from datasets import load_dataset
# Load a specific subset
main = load_dataset("Kassadin88/GLM-5.1-1000000x", "main")
science = load_dataset("Kassadin88/GLM-5.1-1000000x", "PHD-Science")
stem = load_dataset("Kassadin88/GLM-5.1-1000000x", "Multilingual-STEM")
math = load_dataset("Kassadin88/GLM-5.1-1000000x", "Math")
```
Each record is a chat-formatted conversation with a chain-of-thought reasoning trace:
```json
{
"messages": [
{"role": "user", "content": "Beaches and deserts collect large deposits of what? ..."},
{"role": "assistant", "content": "<think>\n1. Analyze the question...\n2. Reasoning step...\n</think>\nSand"}
],
"_id": "main_00000007"
}
```
- `messages`: user question + assistant response with CoT trace and final answer
- `_id`: `{category}_{serial}` (e.g. `Math_00038225`, `PHD-Science_00010138`)
## License
Apache 2.0
## Citation
```bibtex
@misc{glm51-1000000x,
title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1},
author={Kassadin88},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x}
}
```
## Acknowledgments
- Prompt source: [KIMI-K2.5-1000000x](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x)
- Teacher model: [GLM-5.1](https://huggingface.co/zai-org/GLM-5.1)
提供机构:
Kassadin88



