five

GLM-5.1-Reasoning-1M-Cleaned

收藏
魔搭社区2026-05-21 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
# GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** is a cleaned and reformatted derivative of [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x). It preserves the original four-subset layout (`main`, `PHD-Science`, `Multilingual-STEM`, `Math`) while converting every example into a unified SFT-ready schema with explicit `conversations`, `input`, `output`, `domain`, and `meta` fields. This release was prepared from the original dataset published by **Kassadin88**. ## Summary ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - Teacher model in the data: **GLM-5.1** - Total processed records: **766,535** - Total kept records: **746,321** - Total removed records: **20,214** - Subset layout preserved exactly as the source dataset: `main`, `PHD-Science`, `Multilingual-STEM`, `Math` ## Included Content - `main`: general reasoning and instruction-following data. - `PHD-Science`: graduate-level physics, chemistry, and biology reasoning traces. - `Multilingual-STEM`: multilingual STEM reasoning data, including Chinese, English, and other languages present in the source release. - `Math`: mathematics-heavy reasoning and proof-style responses. ## Cleaning and Reformatting The raw source dataset mixed two answer layouts: 1. Standard `<think>...</think>` reasoning tags. 2. A non-standard short-dash wrapper around the reasoning section. This cleaned release normalizes both styles into a single output format: ```json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"} ], "input": "user prompt", "output": "<think>\nreasoning trace\n</think>\n\nfinal answer", "domain": "subset/domain name from the original _id prefix", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ``` ### Removed data The cleaning pipeline removed records with: - incomplete or obviously truncated answers, - repeated reasoning paragraphs or duplicated answer segments, - refusal-style answers, - unparseable reasoning/answer boundaries, - exact duplicate records after normalization. ## Subset Statistics | Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## Filter Statistics | Subset | Issue | Removed | | --- | --- | ---: | | main | incomplete_output | 12,726 | | main | repeated_paragraph | 6,152 | | main | refusal_answer | 638 | | main | unparseable_output | 38 | | main | duplicate_record | 1 | | PHD-Science | incomplete_output | 33 | | PHD-Science | repeated_paragraph | 19 | | PHD-Science | refusal_answer | 1 | | Multilingual-STEM | repeated_paragraph | 116 | | Multilingual-STEM | incomplete_output | 88 | | Multilingual-STEM | refusal_answer | 28 | | Multilingual-STEM | unparseable_output | 19 | | Math | repeated_paragraph | 303 | | Math | incomplete_output | 26 | | Math | unparseable_output | 26 | ## Additional Token Statistics | Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## Data Structure Each example is a single-turn reasoning distillation sample: - `conversations[0]`: the user prompt. - `conversations[1]`: the model response with reasoning wrapped in `<think>...</think>` and the final answer after the closing tag. - `input`: prompt-only view for training pipelines that prefer flat prompt fields. - `output`: tagged answer-only view for training pipelines that prefer flat completion fields. - `domain`: original subset/domain name extracted from the source record ID. - `meta`: lightweight per-example metadata. ## Usage ```python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") ``` If you publish under a different namespace, replace `Jackrong/` with your actual Hugging Face username or org. ## Provenance This dataset is derived from: - Original dataset: [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - Original author: **Kassadin88** ## Citation Please cite the original dataset first: ```bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } ``` You can additionally cite this cleaned derivative release as: ```bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} } ```

# GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** 是 [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) 的经过清洗与重新格式化的衍生数据集。其保留了原始数据集的四个子集布局(`main`、`PHD-Science`、`Multilingual-STEM`、`Math`),同时将所有样本转换为统一的监督微调(Supervised Fine-Tuning,SFT)兼容格式,包含明确的`conversations`、`input`、`output`、`domain`与`meta`字段。 本版本由Kassadin88发布的原始数据集整理而来。 ## 摘要 ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - 数据中使用的教师模型:**GLM-5.1** - 总处理样本数:**766,535** - 最终保留样本数:**746,321** - 移除样本数:**20,214** - 子集布局与源数据集完全一致:`main`、`PHD-Science`、`Multilingual-STEM`、`Math` ## 数据集内容 - `main`:通用推理与指令遵循数据。 - `PHD-Science`:研究生阶段物理、化学与生物学科的推理过程数据。 - `Multilingual-STEM`:多语言STEM推理数据,涵盖源数据集中包含的中文、英文及其他语言。 - `Math`:以数学为主的推理与证明类回复数据。 ## 清洗与格式化流程 原始源数据集混合了两种答案格式: 1. 标准的`<think>...</think>`推理标签格式。 2. 使用非标准短横线包裹推理内容的格式。 本次清洗版本将两种格式统一为单一输出格式: json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think> reasoning trace </think> final answer"} ], "input": "user prompt", "output": "<think> reasoning trace </think> final answer", "domain": "subset/domain name from the original _id prefix", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ### 移除数据说明 清洗流程会移除以下类型的样本: - 不完整或明显截断的答案, - 重复的推理段落或答案片段, - 拒绝类回复, - 无法解析的推理/答案边界, - 归一化后完全重复的样本。 ## 子集统计 | 子集名称 | 处理样本数 | 保留样本数 | 移除样本数 | 文件大小 | 输入Token中位数 | 输出Token中位数 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## 过滤统计 | 子集名称 | 问题类型 | 移除样本数 | | --- | --- | ---: | | main | 输出不完整 | 12,726 | | main | 段落重复 | 6,152 | | main | 拒绝类回复 | 638 | | main | 输出无法解析 | 38 | | main | 重复样本 | 1 | | PHD-Science | 输出不完整 | 33 | | PHD-Science | 段落重复 | 19 | | PHD-Science | 拒绝类回复 | 1 | | Multilingual-STEM | 段落重复 | 116 | | Multilingual-STEM | 输出不完整 | 88 | | Multilingual-STEM | 拒绝类回复 | 28 | | Multilingual-STEM | 输出无法解析 | 19 | | Math | 段落重复 | 303 | | Math | 输出不完整 | 26 | | Math | 输出无法解析 | 26 | ## 额外Token统计 | 子集名称 | 输入Token均值 | 输入Token P95分位数 | 输出Token均值 | 输出Token P95分位数 | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## 数据结构 每个样本为单轮推理蒸馏样本: - `conversations[0]`:用户提示词。 - `conversations[1]`:模型回复,其中推理过程被包裹在`<think>...</think>`标签中,最终答案紧跟闭合标签之后。 - `input`:仅包含提示词的字段,适配偏好扁平化提示字段的训练流水线。 - `output`:仅包含带标签的答案的字段,适配偏好扁平化补全字段的训练流水线。 - `domain`:从源数据集记录ID前缀中提取的原始子集/领域名称。 - `meta`:单样本轻量级元数据。 ## 使用方式 python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") 若需在其他命名空间下发布,请将`Jackrong/`替换为您实际的Hugging Face用户名或组织名称。 ## 数据集溯源 本数据集衍生自: - 原始数据集:[`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - 原始作者:**Kassadin88** ## 引用方式 请优先引用原始数据集: bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } 您也可以额外引用本清洗后的衍生版本: bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} }
提供机构:
maas
创建时间:
2026-04-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作