five

Kimi-K2.5-Reasoning-1M-Cleaned

收藏
魔搭社区2026-05-11 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
# 🪐 Kimi-K2.5-Reasoning-1M-Cleaned **Kimi-K2.5-Reasoning-1M-Cleaned** is a cleaned derivative of [ianncity/KIMI-K2.5-1000000x](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x). It preserves the original four-config layout from the source dataset and rewrites each record into a unified reasoning-SFT schema with `id`, `conversations`, `input`, `output`, `domain`, and `meta`. ![image](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/b1-BDXcO8Fn58aqEbIsFb.png) ## Summary - Source dataset: [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - Source author: **ianncity** - Teacher model recorded in `meta.teacher_model`: `KIMI-K2.5` - Token lengths computed with tokenizer: `moonshotai/Kimi-K2.5` - Total processed records: **1,003,589** - Total kept records: **844,388** - Total removed records: **159,201** - Original source configs preserved: `General-Distillation`, `PHD-Science`, `General-Math`, `MultilingualSTEM` ## What This Release Fixes The source JSONL files expose each example as a two-turn `messages` conversation only. This cleaned release standardizes that raw structure into a training-ready schema and removes records with quality issues. ### Transformations applied 1. Renamed the source `messages` field to `conversations`. 2. Split each record into `input` plus tagged `output`. 3. Normalized `output` into `<think>...</think>` followed by the final answer. 4. Rebuilt `id` as a deterministic MD5 hash over `domain + input + reasoning + answer`. 5. Wrote subset-level provenance into the `domain` field because the source data does not provide a finer per-example domain label. 6. Added `meta.input_tokens`, `meta.output_tokens`, and `meta.teacher_model`. 7. Preserved the original four-config subset boundaries instead of merging everything into one file. ### Removed data The cleaning pipeline filters records with: - malformed or unparseable reasoning / answer boundaries, - incomplete or obviously truncated answers, - refusal-style answers, - repeated reasoning or duplicated answer segments, - exact duplicate records after normalization. ## Dataset Structure ```json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"} ], "input": "user prompt", "output": "<think>\nreasoning trace\n</think>\n\nfinal answer", "domain": "subset-derived label such as General-Math", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "KIMI-K2.5" } } ``` ### Field notes - `conversations[0]`: the user prompt. - `conversations[1]`: the cleaned assistant response with `<think>` tags. - `input`: flat prompt view. - `output`: flat completion view containing reasoning plus final answer. - `domain`: subset-derived label. The source repository does not include an explicit per-example domain field, so this release uses the source config name as the domain value. - `meta`: lightweight token-length metadata and teacher model provenance. ## Subset Statistics | Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | General-Distillation | 598,366 | 553,313 | 45,053 | 14.20 GB | 49 | 2972 | | PHD-Science | 103,759 | 103,307 | 452 | 2.80 GB | 45 | 3021 | | General-Math | 208,426 | 97,771 | 110,655 | 6.16 GB | 56 | 9616 | | MultilingualSTEM | 93,038 | 89,997 | 3,041 | 2.90 GB | 76 | 3956 | ## Filter Statistics | Subset | Issue | Removed | | --- | --- | ---: | | General-Distillation | repeated_paragraph | 38,925 | | General-Distillation | incomplete_output | 4,849 | | General-Distillation | unparseable_output | 748 | | General-Distillation | refusal_answer | 531 | | PHD-Science | incomplete_output | 311 | | PHD-Science | unparseable_output | 101 | | PHD-Science | repeated_paragraph | 37 | | PHD-Science | refusal_answer | 3 | | General-Math | unparseable_output | 99,375 | | General-Math | repeated_paragraph | 7,448 | | General-Math | incomplete_output | 3,832 | | MultilingualSTEM | unparseable_output | 1,841 | | MultilingualSTEM | incomplete_output | 677 | | MultilingualSTEM | repeated_paragraph | 522 | | MultilingualSTEM | refusal_answer | 1 | ## Additional Token Statistics | Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens | | --- | ---: | ---: | ---: | ---: | | General-Distillation | 115.94 | 506 | 3189.8 | 6761 | | PHD-Science | 44.98 | 56 | 3213.31 | 5107 | | General-Math | 57.76 | 81 | 9402.39 | 12485 | | MultilingualSTEM | 79.22 | 123 | 4555.73 | 9082 | ## Included Content - `General-Distillation`: the broad mixed-domain reasoning split from the source release. - `PHD-Science`: science-heavy reasoning traces. - `General-Math`: math-focused reasoning traces. - `MultilingualSTEM`: multilingual STEM reasoning traces. ## Usage ```python from datasets import load_dataset general = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Distillation") science = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "PHD-Science") math_ds = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Math") multi = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "MultilingualSTEM") ``` ## Provenance - Original dataset: [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - Original author: **ianncity** - This release is a cleaned derivative and should not be treated as the original source dataset. ## Citation Please cite the original dataset: ```bibtex @misc{kimi_k25_1000000x, title={KIMI-K2.5-1000000x}, author={ianncity}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x} } ``` You can additionally cite this cleaned derivative release: ```bibtex @misc{kimi_k25_reasoning_1m_cleaned, title={Kimi-K2.5-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned} } ```

# 🪐 Kimi-K2.5-Reasoning-1M-Cleaned **Kimi-K2.5-Reasoning-1M-Cleaned** 是 [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) 的清洗版衍生数据集。其保留了源数据集的四种配置布局,并将每条记录重写为统一的推理监督微调(Supervised Fine-Tuning, SFT)模式,包含`id`、`conversations`、`input`、`output`、`domain`及`meta`字段。 ![image](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/b1-BDXcO8Fn58aqEbIsFb.png) ## 摘要 - 源数据集:[`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - 源作者:**ianncity** - `meta.teacher_model` 中记录的教师模型:`KIMI-K2.5` - 使用分词器`moonshotai/Kimi-K2.5`计算Token(Token)长度 - 总处理记录数:**1,003,589** - 保留记录数:**844,388** - 移除记录数:**159,201** - 保留的原始源配置:`General-Distillation`、`PHD-Science`、`General-Math`、`MultilingualSTEM` ## 本次发布的修复内容 原始JSONL文件仅将每个示例以两轮`messages`对话的形式呈现。此清洗版本将原始结构标准化为适合训练的格式,并移除了存在质量问题的记录。 ### 执行的转换操作 1. 将源数据中的`messages`字段重命名为`conversations`。 2. 将每条记录拆分为`input`与带标签的`output`。 3. 将`output`规范化为以`<think>...</think>`包裹推理过程、后跟最终答案的格式。 4. 将`id`重建为基于`domain + input + reasoning + answer`的确定性MD5哈希值。 5. 将子集级别的来源信息写入`domain`字段,因源数据未提供更细粒度的单条示例域标签。 6. 新增`meta.input_tokens`、`meta.output_tokens`及`meta.teacher_model`字段。 7. 保留原始的四种配置子集划分,未将所有数据合并为单个文件。 ### 被移除的数据 清洗流程会过滤以下类型的记录: - 格式错误或无法解析的推理/答案边界, - 不完整或明显被截断的答案, - 拒绝类回复, - 重复的推理或答案片段, - 标准化后完全重复的记录。 ## 数据集结构 json { "id": "domain-input-reasoning-answer的MD5哈希值", "conversations": [ {"from": "human", "value": "用户提示词"}, {"from": "gpt", "value": "<think> 推理过程 </think> 最终答案"} ], "input": "用户提示词", "output": "<think> 推理过程 </think> 最终答案", "domain": "子集衍生标签,例如General-Math", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "KIMI-K2.5" } } ### 字段说明 - `conversations[0]`:用户的提示词。 - `conversations[1]`:带有`<think>`标签的清洗后助手回复。 - `input`:扁平化的提示词视图。 - `output`:扁平化的完成结果视图,包含推理过程与最终答案。 - `domain`:子集衍生的标签。因源仓库未提供显式的单条示例域字段,此版本使用源配置名称作为域值。 - `meta`:轻量级的Token长度元数据与教师模型来源信息。 ## 子集统计数据 | 子集 | 处理记录数 | 保留记录数 | 移除记录数 | 文件大小 | 输入Token中位数 | 输出Token中位数 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | General-Distillation | 598,366 | 553,313 | 45,053 | 14.20 GB | 49 | 2972 | | PHD-Science | 103,759 | 103,307 | 452 | 2.80 GB | 45 | 3021 | | General-Math | 208,426 | 97,771 | 110,655 | 6.16 GB | 56 | 9616 | | MultilingualSTEM | 93,038 | 89,997 | 3,041 | 2.90 GB | 76 | 3956 | ## 过滤统计数据 | 子集 | 问题类型 | 移除记录数 | | --- | --- | ---: | | General-Distillation | 段落重复 | 38,925 | | General-Distillation | 输出不完整 | 4,849 | | General-Distillation | 输出无法解析 | 748 | | General-Distillation | 拒绝类回复 | 531 | | PHD-Science | 输出不完整 | 311 | | PHD-Science | 输出无法解析 | 101 | | PHD-Science | 段落重复 | 37 | | PHD-Science | 拒绝类回复 | 3 | | General-Math | 输出无法解析 | 99,375 | | General-Math | 段落重复 | 7,448 | | General-Math | 输出不完整 | 3,832 | | MultilingualSTEM | 输出无法解析 | 1,841 | | MultilingualSTEM | 输出不完整 | 677 | | MultilingualSTEM | 段落重复 | 522 | | MultilingualSTEM | 拒绝类回复 | 1 | ## 额外Token统计数据 | 子集 | 输入Token均值 | 输入Token P95分位数 | 输出Token均值 | 输出Token P95分位数 | | --- | ---: | ---: | ---: | ---: | | General-Distillation | 115.94 | 506 | 3189.8 | 6761 | | PHD-Science | 44.98 | 56 | 3213.31 | 5107 | | General-Math | 57.76 | 81 | 9402.39 | 12485 | | MultilingualSTEM | 79.22 | 123 | 4555.73 | 9082 | ## 包含内容 - `General-Distillation`:源版本中广泛混合域的推理拆分子集。 - `PHD-Science`:偏重科学领域的推理过程数据集。 - `General-Math`:专注数学领域的推理过程数据集。 - `MultilingualSTEM`:多语言STEM(科学、技术、工程、数学)推理过程数据集。 ## 使用方法 python from datasets import load_dataset general = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Distillation") science = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "PHD-Science") math_ds = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Math") multi = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "MultilingualSTEM") ## 来源说明 - 原始数据集:[`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) - 原始作者:**ianncity** - 此版本为经过清洗的衍生数据集,不应被视为原始源数据集。 ## 引用格式 请引用原始数据集: bibtex @misc{kimi_k25_1000000x, title={KIMI-K2.5-1000000x}, author={ianncity}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x} } 你也可以额外引用此清洗后的衍生版本: bibtex @misc{kimi_k25_reasoning_1m_cleaned, title={Kimi-K2.5-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned} }
提供机构:
maas
创建时间:
2026-04-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作