Kimi-K2.5-Reasoning-1M-Cleaned

Name: Kimi-K2.5-Reasoning-1M-Cleaned
Creator: maas
Published: 2026-05-11 17:42:18
License: 暂无描述

魔搭社区2026-05-11 更新2026-05-03 收录

下载链接：

https://modelscope.cn/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

# 🪐 Kimi-K2.5-Reasoning-1M-Cleaned **Kimi-K2.5-Reasoning-1M-Cleaned** is a cleaned derivative of [ianncity/KIMI-K2.5-1000000x](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x). It preserves the original four-config layout from the source dataset and rewrites each record into a unified reasoning-SFT schema with `id`, `conversations`, `input`, `output`, `domain`, and `meta`. ![image](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/b1-BDXcO8Fn58aqEbIsFb.png) ## Summary - Source dataset: [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - Source author: **ianncity** - Teacher model recorded in `meta.teacher_model`: `KIMI-K2.5` - Token lengths computed with tokenizer: `moonshotai/Kimi-K2.5` - Total processed records: **1,003,589** - Total kept records: **844,388** - Total removed records: **159,201** - Original source configs preserved: `General-Distillation`, `PHD-Science`, `General-Math`, `MultilingualSTEM` ## What This Release Fixes The source JSONL files expose each example as a two-turn `messages` conversation only. This cleaned release standardizes that raw structure into a training-ready schema and removes records with quality issues. ### Transformations applied 1. Renamed the source `messages` field to `conversations`. 2. Split each record into `input` plus tagged `output`. 3. Normalized `output` into `<think>...</think>` followed by the final answer. 4. Rebuilt `id` as a deterministic MD5 hash over `domain + input + reasoning + answer`. 5. Wrote subset-level provenance into the `domain` field because the source data does not provide a finer per-example domain label. 6. Added `meta.input_tokens`, `meta.output_tokens`, and `meta.teacher_model`. 7. Preserved the original four-config subset boundaries instead of merging everything into one file. ### Removed data The cleaning pipeline filters records with: - malformed or unparseable reasoning / answer boundaries, - incomplete or obviously truncated answers, - refusal-style answers, - repeated reasoning or duplicated answer segments, - exact duplicate records after normalization. ## Dataset Structure ```json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"} ], "input": "user prompt", "output": "<think>\nreasoning trace\n</think>\n\nfinal answer", "domain": "subset-derived label such as General-Math", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "KIMI-K2.5" } } ``` ### Field notes - `conversations[0]`: the user prompt. - `conversations[1]`: the cleaned assistant response with `<think>` tags. - `input`: flat prompt view. - `output`: flat completion view containing reasoning plus final answer. - `domain`: subset-derived label. The source repository does not include an explicit per-example domain field, so this release uses the source config name as the domain value. - `meta`: lightweight token-length metadata and teacher model provenance. ## Subset Statistics | Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | General-Distillation | 598,366 | 553,313 | 45,053 | 14.20 GB | 49 | 2972 | | PHD-Science | 103,759 | 103,307 | 452 | 2.80 GB | 45 | 3021 | | General-Math | 208,426 | 97,771 | 110,655 | 6.16 GB | 56 | 9616 | | MultilingualSTEM | 93,038 | 89,997 | 3,041 | 2.90 GB | 76 | 3956 | ## Filter Statistics | Subset | Issue | Removed | | --- | --- | ---: | | General-Distillation | repeated_paragraph | 38,925 | | General-Distillation | incomplete_output | 4,849 | | General-Distillation | unparseable_output | 748 | | General-Distillation | refusal_answer | 531 | | PHD-Science | incomplete_output | 311 | | PHD-Science | unparseable_output | 101 | | PHD-Science | repeated_paragraph | 37 | | PHD-Science | refusal_answer | 3 | | General-Math | unparseable_output | 99,375 | | General-Math | repeated_paragraph | 7,448 | | General-Math | incomplete_output | 3,832 | | MultilingualSTEM | unparseable_output | 1,841 | | MultilingualSTEM | incomplete_output | 677 | | MultilingualSTEM | repeated_paragraph | 522 | | MultilingualSTEM | refusal_answer | 1 | ## Additional Token Statistics | Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens | | --- | ---: | ---: | ---: | ---: | | General-Distillation | 115.94 | 506 | 3189.8 | 6761 | | PHD-Science | 44.98 | 56 | 3213.31 | 5107 | | General-Math | 57.76 | 81 | 9402.39 | 12485 | | MultilingualSTEM | 79.22 | 123 | 4555.73 | 9082 | ## Included Content - `General-Distillation`: the broad mixed-domain reasoning split from the source release. - `PHD-Science`: science-heavy reasoning traces. - `General-Math`: math-focused reasoning traces. - `MultilingualSTEM`: multilingual STEM reasoning traces. ## Usage ```python from datasets import load_dataset general = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Distillation") science = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "PHD-Science") math_ds = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Math") multi = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "MultilingualSTEM") ``` ## Provenance - Original dataset: [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - Original author: **ianncity** - This release is a cleaned derivative and should not be treated as the original source dataset. ## Citation Please cite the original dataset: ```bibtex @misc{kimi_k25_1000000x, title={KIMI-K2.5-1000000x}, author={ianncity}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x} } ``` You can additionally cite this cleaned derivative release: ```bibtex @misc{kimi_k25_reasoning_1m_cleaned, title={Kimi-K2.5-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned} } ```

# 🪐 Kimi-K2.5-Reasoning-1M-Cleaned **Kimi-K2.5-Reasoning-1M-Cleaned** 是 [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) 的清洗版衍生数据集。其保留了源数据集的四种配置布局，并将每条记录重写为统一的推理监督微调（Supervised Fine-Tuning, SFT）模式，包含`id`、`conversations`、`input`、`output`、`domain`及`meta`字段。 ![image](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/b1-BDXcO8Fn58aqEbIsFb.png) ## 摘要 - 源数据集：[`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - 源作者：**ianncity** - `meta.teacher_model` 中记录的教师模型：`KIMI-K2.5` - 使用分词器`moonshotai/Kimi-K2.5`计算Token（Token）长度 - 总处理记录数：**1,003,589** - 保留记录数：**844,388** - 移除记录数：**159,201** - 保留的原始源配置：`General-Distillation`、`PHD-Science`、`General-Math`、`MultilingualSTEM` ## 本次发布的修复内容原始JSONL文件仅将每个示例以两轮`messages`对话的形式呈现。此清洗版本将原始结构标准化为适合训练的格式，并移除了存在质量问题的记录。 ### 执行的转换操作 1. 将源数据中的`messages`字段重命名为`conversations`。 2. 将每条记录拆分为`input`与带标签的`output`。 3. 将`output`规范化为以`<think>...</think>`包裹推理过程、后跟最终答案的格式。 4. 将`id`重建为基于`domain + input + reasoning + answer`的确定性MD5哈希值。 5. 将子集级别的来源信息写入`domain`字段，因源数据未提供更细粒度的单条示例域标签。 6. 新增`meta.input_tokens`、`meta.output_tokens`及`meta.teacher_model`字段。 7. 保留原始的四种配置子集划分，未将所有数据合并为单个文件。 ### 被移除的数据清洗流程会过滤以下类型的记录： - 格式错误或无法解析的推理/答案边界， - 不完整或明显被截断的答案， - 拒绝类回复， - 重复的推理或答案片段， - 标准化后完全重复的记录。 ## 数据集结构 json { "id": "domain-input-reasoning-answer的MD5哈希值", "conversations": [ {"from": "human", "value": "用户提示词"}, {"from": "gpt", "value": "<think> 推理过程 </think> 最终答案"} ], "input": "用户提示词", "output": "<think> 推理过程 </think> 最终答案", "domain": "子集衍生标签，例如General-Math", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "KIMI-K2.5" } } ### 字段说明 - `conversations[0]`：用户的提示词。 - `conversations[1]`：带有`<think>`标签的清洗后助手回复。 - `input`：扁平化的提示词视图。 - `output`：扁平化的完成结果视图，包含推理过程与最终答案。 - `domain`：子集衍生的标签。因源仓库未提供显式的单条示例域字段，此版本使用源配置名称作为域值。 - `meta`：轻量级的Token长度元数据与教师模型来源信息。 ## 子集统计数据 | 子集 | 处理记录数 | 保留记录数 | 移除记录数 | 文件大小 | 输入Token中位数 | 输出Token中位数 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | General-Distillation | 598,366 | 553,313 | 45,053 | 14.20 GB | 49 | 2972 | | PHD-Science | 103,759 | 103,307 | 452 | 2.80 GB | 45 | 3021 | | General-Math | 208,426 | 97,771 | 110,655 | 6.16 GB | 56 | 9616 | | MultilingualSTEM | 93,038 | 89,997 | 3,041 | 2.90 GB | 76 | 3956 | ## 过滤统计数据 | 子集 | 问题类型 | 移除记录数 | | --- | --- | ---: | | General-Distillation | 段落重复 | 38,925 | | General-Distillation | 输出不完整 | 4,849 | | General-Distillation | 输出无法解析 | 748 | | General-Distillation | 拒绝类回复 | 531 | | PHD-Science | 输出不完整 | 311 | | PHD-Science | 输出无法解析 | 101 | | PHD-Science | 段落重复 | 37 | | PHD-Science | 拒绝类回复 | 3 | | General-Math | 输出无法解析 | 99,375 | | General-Math | 段落重复 | 7,448 | | General-Math | 输出不完整 | 3,832 | | MultilingualSTEM | 输出无法解析 | 1,841 | | MultilingualSTEM | 输出不完整 | 677 | | MultilingualSTEM | 段落重复 | 522 | | MultilingualSTEM | 拒绝类回复 | 1 | ## 额外Token统计数据 | 子集 | 输入Token均值 | 输入Token P95分位数 | 输出Token均值 | 输出Token P95分位数 | | --- | ---: | ---: | ---: | ---: | | General-Distillation | 115.94 | 506 | 3189.8 | 6761 | | PHD-Science | 44.98 | 56 | 3213.31 | 5107 | | General-Math | 57.76 | 81 | 9402.39 | 12485 | | MultilingualSTEM | 79.22 | 123 | 4555.73 | 9082 | ## 包含内容 - `General-Distillation`：源版本中广泛混合域的推理拆分子集。 - `PHD-Science`：偏重科学领域的推理过程数据集。 - `General-Math`：专注数学领域的推理过程数据集。 - `MultilingualSTEM`：多语言STEM（科学、技术、工程、数学）推理过程数据集。 ## 使用方法 python from datasets import load_dataset general = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Distillation") science = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "PHD-Science") math_ds = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Math") multi = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "MultilingualSTEM") ## 来源说明 - 原始数据集：[`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - 原始作者：**ianncity** - 此版本为经过清洗的衍生数据集，不应被视为原始源数据集。 ## 引用格式请引用原始数据集： bibtex @misc{kimi_k25_1000000x, title={KIMI-K2.5-1000000x}, author={ianncity}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x} } 你也可以额外引用此清洗后的衍生版本： bibtex @misc{kimi_k25_reasoning_1m_cleaned, title={Kimi-K2.5-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned} }

提供机构：

maas

创建时间：

2026-04-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集