lightseekorg/kimi-mtp-dataset

Name: lightseekorg/kimi-mtp-dataset
Creator: lightseekorg
Published: 2026-03-31 17:45:42
License: 暂无描述

Hugging Face2026-03-31 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/lightseekorg/kimi-mtp-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: conversations list: - name: from dtype: string - name: value dtype: string - name: source dtype: string splits: - name: train num_examples: 476904 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 language: - en - zh tags: - speculative-decoding - eagle3 - kimi-k2.5 - draft-model - conversations --- # Kimi-K2.5 Eagle3 Training Data This dataset contains the instruction-following data used to train an **Eagle3 MTP draft model** for [Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) with [TorchSpec](https://github.com/torchspec-project/TorchSpec). All responses were **regenerated by running Kimi-K2.5 via Engine** rather than taken from the original datasets. This is critical for speculative decoding training: the draft model must learn the exact token-level distribution of the target model it is accelerating. The trained Eagle3 draft model is available at [lightseekorg/kimi-k2.5-eagle3](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3). If you find this draft model useful, please give our project **TorchSpec** a 🌟 on [GitHub](https://github.com/torchspec-project/TorchSpec). ## Data source Due to inference resource constraints, some source datasets are only partially regenerated. Here is the list of source datasets used in this mix: | Dataset | Source | # Samples | |---------|--------|-----------| | [mlabonne/open-perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend) | `perfectblend` | 296,034 | | [liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | `llava_instruct` | 123,102 | | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | `smoltalk_cn` | 48,333 | | [daviddjtzafon/continual-tool-kimi-k2.5](https://huggingface.co/datasets/daviddjtzafon/continual-tool-kimi-k2.5) | `continual_tool_kimi` | 4,370 | | [crownelius/KimiK2.5-2000x-formatted](https://huggingface.co/datasets/crownelius/KimiK2.5-2000x-formatted) | `kimi_2000x` | 2,144 | | [crownelius/Creative-Writing-KimiK2.5-Cleaned](https://huggingface.co/datasets/crownelius/Creative-Writing-KimiK2.5-Cleaned) | `creative_writing` | 1,393 | | [DCAgent2/terminal_bench_2](https://huggingface.co/datasets/DCAgent2/terminal_bench_2__together_ai_moonshotai_Kimi-K2.5_20260203) | `dcagent` | 873 | | [crownelius/Creative-Writing-Reasoning-KimiK2.5-600x](https://huggingface.co/datasets/crownelius/Creative-Writing-Reasoning-KimiK2.5-600x) | `creative_writing_reasoning` | 655 | | **Total** | | **476,904** | ## Data format Each sample contains two fields: - **`conversations`**: a list of turns, each with `from` (`human` / `gpt` / `system`) and `value` (string). - **`source`**: the name of the source dataset (see table above). ```json { "conversations": [ {"from": "human", "value": "What is the capital of France?"}, {"from": "gpt", "value": "The capital of France is Paris."} ], "source": "perfectblend" } ``` Multimodal samples (`llava_instruct`) use OpenAI vision format in the `value` field — a list of `image_url` and `text` objects — with local image paths replaced by public COCO URLs (`http://images.cocodataset.org/train2017/{filename}`). Function-call samples (`continual_tool_kimi`) use Kimi-K2.5's special token format for tool calls: ``` <|tool_calls_section_begin|><|tool_call_begin|>{id}<|tool_call_argument_begin|>{args_json}<|tool_call_end|><|tool_calls_section_end|> ``` Tool results are serialized as `human` turns with the prefix `## Return of {call_id}\n`. ## Training See [TorchSpec](https://github.com/torchspec-project/TorchSpec) for the full training recipe, configuration, and evaluation results. ## License Apache 2.0. All source datasets are Apache 2.0 or MIT licensed.

提供机构：

lightseekorg

5,000+

优质数据集

54 个

任务类型

进入经典数据集