GLM-5.1-Reasoning-1M-Cleaned

Name: GLM-5.1-Reasoning-1M-Cleaned
Creator: maas
Published: 2026-05-21 11:30:46
License: 暂无描述

魔搭社区2026-05-21 更新2026-05-03 收录

下载链接：

https://modelscope.cn/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

# GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** is a cleaned and reformatted derivative of [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x). It preserves the original four-subset layout (`main`, `PHD-Science`, `Multilingual-STEM`, `Math`) while converting every example into a unified SFT-ready schema with explicit `conversations`, `input`, `output`, `domain`, and `meta` fields. This release was prepared from the original dataset published by **Kassadin88**. ## Summary ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - Teacher model in the data: **GLM-5.1** - Total processed records: **766,535** - Total kept records: **746,321** - Total removed records: **20,214** - Subset layout preserved exactly as the source dataset: `main`, `PHD-Science`, `Multilingual-STEM`, `Math` ## Included Content - `main`: general reasoning and instruction-following data. - `PHD-Science`: graduate-level physics, chemistry, and biology reasoning traces. - `Multilingual-STEM`: multilingual STEM reasoning data, including Chinese, English, and other languages present in the source release. - `Math`: mathematics-heavy reasoning and proof-style responses. ## Cleaning and Reformatting The raw source dataset mixed two answer layouts: 1. Standard `<think>...</think>` reasoning tags. 2. A non-standard short-dash wrapper around the reasoning section. This cleaned release normalizes both styles into a single output format: ```json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"} ], "input": "user prompt", "output": "<think>\nreasoning trace\n</think>\n\nfinal answer", "domain": "subset/domain name from the original _id prefix", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ``` ### Removed data The cleaning pipeline removed records with: - incomplete or obviously truncated answers, - repeated reasoning paragraphs or duplicated answer segments, - refusal-style answers, - unparseable reasoning/answer boundaries, - exact duplicate records after normalization. ## Subset Statistics | Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## Filter Statistics | Subset | Issue | Removed | | --- | --- | ---: | | main | incomplete_output | 12,726 | | main | repeated_paragraph | 6,152 | | main | refusal_answer | 638 | | main | unparseable_output | 38 | | main | duplicate_record | 1 | | PHD-Science | incomplete_output | 33 | | PHD-Science | repeated_paragraph | 19 | | PHD-Science | refusal_answer | 1 | | Multilingual-STEM | repeated_paragraph | 116 | | Multilingual-STEM | incomplete_output | 88 | | Multilingual-STEM | refusal_answer | 28 | | Multilingual-STEM | unparseable_output | 19 | | Math | repeated_paragraph | 303 | | Math | incomplete_output | 26 | | Math | unparseable_output | 26 | ## Additional Token Statistics | Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## Data Structure Each example is a single-turn reasoning distillation sample: - `conversations[0]`: the user prompt. - `conversations[1]`: the model response with reasoning wrapped in `<think>...</think>` and the final answer after the closing tag. - `input`: prompt-only view for training pipelines that prefer flat prompt fields. - `output`: tagged answer-only view for training pipelines that prefer flat completion fields. - `domain`: original subset/domain name extracted from the source record ID. - `meta`: lightweight per-example metadata. ## Usage ```python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") ``` If you publish under a different namespace, replace `Jackrong/` with your actual Hugging Face username or org. ## Provenance This dataset is derived from: - Original dataset: [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - Original author: **Kassadin88** ## Citation Please cite the original dataset first: ```bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } ``` You can additionally cite this cleaned derivative release as: ```bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} } ```

# GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** 是 [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) 的经过清洗与重新格式化的衍生数据集。其保留了原始数据集的四个子集布局（`main`、`PHD-Science`、`Multilingual-STEM`、`Math`），同时将所有样本转换为统一的监督微调（Supervised Fine-Tuning，SFT）兼容格式，包含明确的`conversations`、`input`、`output`、`domain`与`meta`字段。本版本由Kassadin88发布的原始数据集整理而来。 ## 摘要 ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - 数据中使用的教师模型：**GLM-5.1** - 总处理样本数：**766,535** - 最终保留样本数：**746,321** - 移除样本数：**20,214** - 子集布局与源数据集完全一致：`main`、`PHD-Science`、`Multilingual-STEM`、`Math` ## 数据集内容 - `main`：通用推理与指令遵循数据。 - `PHD-Science`：研究生阶段物理、化学与生物学科的推理过程数据。 - `Multilingual-STEM`：多语言STEM推理数据，涵盖源数据集中包含的中文、英文及其他语言。 - `Math`：以数学为主的推理与证明类回复数据。 ## 清洗与格式化流程原始源数据集混合了两种答案格式： 1. 标准的`<think>...</think>`推理标签格式。 2. 使用非标准短横线包裹推理内容的格式。本次清洗版本将两种格式统一为单一输出格式： json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think> reasoning trace </think> final answer"} ], "input": "user prompt", "output": "<think> reasoning trace </think> final answer", "domain": "subset/domain name from the original _id prefix", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ### 移除数据说明清洗流程会移除以下类型的样本： - 不完整或明显截断的答案， - 重复的推理段落或答案片段， - 拒绝类回复， - 无法解析的推理/答案边界， - 归一化后完全重复的样本。 ## 子集统计 | 子集名称 | 处理样本数 | 保留样本数 | 移除样本数 | 文件大小 | 输入Token中位数 | 输出Token中位数 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## 过滤统计 | 子集名称 | 问题类型 | 移除样本数 | | --- | --- | ---: | | main | 输出不完整 | 12,726 | | main | 段落重复 | 6,152 | | main | 拒绝类回复 | 638 | | main | 输出无法解析 | 38 | | main | 重复样本 | 1 | | PHD-Science | 输出不完整 | 33 | | PHD-Science | 段落重复 | 19 | | PHD-Science | 拒绝类回复 | 1 | | Multilingual-STEM | 段落重复 | 116 | | Multilingual-STEM | 输出不完整 | 88 | | Multilingual-STEM | 拒绝类回复 | 28 | | Multilingual-STEM | 输出无法解析 | 19 | | Math | 段落重复 | 303 | | Math | 输出不完整 | 26 | | Math | 输出无法解析 | 26 | ## 额外Token统计 | 子集名称 | 输入Token均值 | 输入Token P95分位数 | 输出Token均值 | 输出Token P95分位数 | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## 数据结构每个样本为单轮推理蒸馏样本： - `conversations[0]`：用户提示词。 - `conversations[1]`：模型回复，其中推理过程被包裹在`<think>...</think>`标签中，最终答案紧跟闭合标签之后。 - `input`：仅包含提示词的字段，适配偏好扁平化提示字段的训练流水线。 - `output`：仅包含带标签的答案的字段，适配偏好扁平化补全字段的训练流水线。 - `domain`：从源数据集记录ID前缀中提取的原始子集/领域名称。 - `meta`：单样本轻量级元数据。 ## 使用方式 python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") 若需在其他命名空间下发布，请将`Jackrong/`替换为您实际的Hugging Face用户名或组织名称。 ## 数据集溯源本数据集衍生自： - 原始数据集：[`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - 原始作者：**Kassadin88** ## 引用方式请优先引用原始数据集： bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } 您也可以额外引用本清洗后的衍生版本： bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} }

提供机构：

maas

创建时间：

2026-04-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集