Kassadin88/GLM-5.1-1000000x

Name: Kassadin88/GLM-5.1-1000000x
Creator: Kassadin88
Published: 2026-04-17 05:25:15
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Kassadin88/GLM-5.1-1000000x

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en - zh size_categories: - n>1M task_categories: - text-generation - question-answering tags: - reasoning - chain-of-thought - instruction-tuning - sft - distillation - glm - glm-5.1 configs: - config_name: main data_files: - split: train path: "main.jsonl" - config_name: PHD-Science data_files: - split: train path: "PHD-Science.jsonl" - config_name: Multilingual-STEM data_files: - split: train path: "Multilingual-STEM.jsonl" - config_name: Math data_files: - split: train path: "Math.jsonl" --- <div align="center"> <img src="https://raw.githubusercontent.com/zai-org/GLM-5/refs/heads/main/resources/logo.svg" width="15%" /> </div> # GLM-5.1-1000000x **1,003,589** reasoning traces distilled by **GLM-5.1**, using questions from [KIMI-K2.5-1000000x](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x). Each entry contains a full chain-of-thought reasoning trace followed by the final answer, generated by GLM-5.1. > **Complete!** All 1,003,589 prompts distilled successfully. > > ████████████████████████████████ 100% --- ## Data Distribution | Subset | Count | Proportion | Est. Tokens | Domain | |--------|------:|:----------:|:-----------:|--------| | main | 598,366 | 59.6% | ~3.04B | General reasoning & instruction-following | | Math | 208,426 | 20.8% | ~1.30B | Mathematics | | PHD-Science | 103,759 | 10.3% | ~0.56B | Graduate-level Physics, Chemistry, Biology | | Multilingual-STEM | 93,038 | 9.3% | ~0.46B | STEM in Chinese, English & other languages | | **Total** | **1,003,589** | **100%** | **~5.36B** | | ## Dataset Statistics | Metric | Value | |--------|-------| | Total Records | 1,003,589 | | Total Estimated Tokens | ~5.36B | | Avg. Tokens per Record | ~5,338 | ## How to Use ```python from datasets import load_dataset # Load a specific subset main = load_dataset("Kassadin88/GLM-5.1-1000000x", "main") science = load_dataset("Kassadin88/GLM-5.1-1000000x", "PHD-Science") stem = load_dataset("Kassadin88/GLM-5.1-1000000x", "Multilingual-STEM") math = load_dataset("Kassadin88/GLM-5.1-1000000x", "Math") ``` Each record is a chat-formatted conversation with a chain-of-thought reasoning trace: ```json { "messages": [ {"role": "user", "content": "Beaches and deserts collect large deposits of what? ..."}, {"role": "assistant", "content": "<think>\n1. Analyze the question...\n2. Reasoning step...\n</think>\nSand"} ], "_id": "main_00000007" } ``` - `messages`: user question + assistant response with CoT trace and final answer - `_id`: `{category}_{serial}` (e.g. `Math_00038225`, `PHD-Science_00010138`) ## License Apache 2.0 ## Citation ```bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } ``` ## Acknowledgments - Prompt source: [KIMI-K2.5-1000000x](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - Teacher model: [GLM-5.1](https://huggingface.co/zai-org/GLM-5.1)

提供机构：

Kassadin88

5,000+

优质数据集

54 个

任务类型

进入经典数据集