five

camel-ai/seta-sft-kimi-k2.5-thinking

收藏
Hugging Face2026-04-14 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/camel-ai/seta-sft-kimi-k2.5-thinking
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en task_categories: - text-generation tags: - sft - agent - terminal - tool-use - qwen3 size_categories: - 1K<n<10K dataset_info: features: - name: task_id dtype: string - name: trial_uid dtype: string - name: reward dtype: float64 - name: model dtype: string - name: conv_json_path dtype: string - name: provider_token_counts struct: - name: cached_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: prompt_tokens dtype: int64 - name: total_tokens dtype: int64 - name: local_token_count dtype: int64 - name: n_assistant_tokens dtype: int64 - name: n_messages dtype: int64 - name: raw_conv_json dtype: string - name: chat_template_str dtype: string - name: input_ids list: int32 - name: loss_mask list: int64 splits: - name: train num_bytes: 406433551 num_examples: 1768 download_size: 299197651 dataset_size: 406433551 configs: - config_name: default data_files: - split: train path: data/train-* --- # Seta SFT — Kimi K2.5 (thinking) Supervised fine-tuning dataset distilled from 1488 successful agent rollouts of **moonshot/kimi-k2.5** on the [seta-env-v2](https://huggingface.co/datasets/camel-ai/seta-env-v2) terminal-agent benchmark, tokenized with the **Qwen/Qwen3-8B** chat template and ready for AREAL `FSDPLMEngine` SFT training. ## Schema Each row preserves the full per-trial diagnostic record from the build pipeline so consumers can inspect, filter, or re-tokenize without rerunning the rollouts: | column | type | meaning | |---|---|---| | `task_id` | str | task name (without `_t<i>_<hash>` suffix) | | `trial_uid` | str | full trial dir name | | `reward` | float | passed/total tests from `verifier/ctrf.json` | | `model` | str | rollout model id (e.g. `kimi-k2.5`) | | `conv_json_path` | str | source conv json path on the build machine (provenance) | | `provider_token_counts` | dict | LLM API counts: `prompt_tokens`, `completion_tokens`, `total_tokens`, `cached_tokens` | | `local_token_count` | int | `len(input_ids)` under the Qwen3-8B tokenizer | | `n_assistant_tokens` | int | `sum(loss_mask)` | | `n_messages` | int | reconstructed conversation length (incl. final assistant turn) | | `raw_conv_json` | str | full original conv json, serialized | | `chat_template_str` | str | exact rendered template that was tokenized | | `input_ids` | list[int] | tokenized full conversation | | `loss_mask` | list[int] | same length, `1` on every assistant span (incl. `<think>` reasoning blocks, content, and `<tool_call>...</tool_call>`), `0` on system / user / tool-response / padding | For AREAL training, the loader [`scripts/areal_sft/seta_sft_dataset.py::get_seta_sft_dataset`](https://github.com/camel-ai/terminal_agent/blob/areal_dev/scripts/areal_sft/seta_sft_dataset.py) projects each row to `(input_ids, loss_mask)` only — the trainer's `pad_sequences_to_tensors` collator consumes nothing else. ## Statistics | | value | |---|---| | rows | **1,488** | | total tokens | **15,688,749** | | trainable tokens | **7,787,881** (49.6%) | | tokens / row (mean) | 10544 | | tokens / row (median) | 8884 | | tokens / row (max) | 28,937 | | reward = 1.0 rows | 1,112 (74.7%) | | reward (mean) | 0.932 | ## Provenance & processing 1. **Rollouts.** moonshot/kimi-k2.5 was run via the TITO (Token-In-Token-Out) agent on every task in `seta-env-v2` (1 trajectory per task, max 200 iterations, 28 672 max total tokens, 4 096 max completion tokens, temperature 1.0, thinking enabled). Tasks that crashed or hit rate limits were re-run via `eval.py --resume` until convergence (4 resume passes). 2. **Merging.** Resume chain merged into one canonical dir via `seta_env.utils.collect_results --merge --collect-trials move` — 1605 trial subdirs total, 18 fully-failed tasks, 1488 successfully produced both `verifier/ctrf.json` and a conversation snapshot in `CAMEL_LOG_DIR/`. 3. **Tokenization & loss-mask construction.** `seta_env.utils.sft_utils.build_sft_dataset` walks each trial dir, picks the largest `CAMEL_LOG_DIR/.../conv_*.json` (the most-complete snapshot), reconstructs `request.messages + response.choices[0].message`, applies the Qwen3-8B chat template, and builds the per-token loss mask via a token-stream scan that marks every `<|im_start|>assistant ... <|im_end|>\n` span as trainable. Boundary tokens (`<|im_start|>`, `<|im_end|>`, `<think>`, `</think>`, `<tool_call>`, trailing `\n`) are all attributed to the correct side of the mask — verified by per-trial inspection artifacts under the build script's `--debug` mode. 4. **Filtering.** Trials whose `verifier/ctrf.json` was missing entirely (errored or timed out before verification ran) were dropped — 117 of the 1605 trial subdirs. Trials with any verified reward (including 0/N rollouts) were kept; downstream consumers can further filter on `reward` directly using the column on this dataset. 5. **Thinking handling.** Each assistant turn includes its full `<think>...</think>` reasoning trace (extracted from the rollout's `reasoning_content` field) followed by the visible content + tool calls. The model is trained to **emit chain-of-thought** before every response. ## Companion datasets - [camel-ai/seta-sft-kimi-k2.5-nothink](https://huggingface.co/datasets/camel-ai/seta-sft-kimi-k2.5-nothink) — same rollouts, no-thinking variant. Useful for ablations. ## Loading For inspection / custom processing (full row): ```python from datasets import load_dataset ds = load_dataset("camel-ai/seta-sft-kimi-k2.5-thinking", split="train") print(ds[0]["task_id"], ds[0]["reward"], ds[0]["local_token_count"]) print(ds[0]["chat_template_str"][:500]) ``` For AREAL-trainer-ready loading (projected to `(input_ids, loss_mask)`): ```python from seta_sft_dataset import get_seta_sft_dataset ds = get_seta_sft_dataset("camel-ai/seta-sft-kimi-k2.5-thinking", split="train", max_length=16384) # Dataset({features: ['input_ids', 'loss_mask'], num_rows: ...}) ``` ## License Apache 2.0 (matches the upstream `camel-ai/seta-env-v2` license).
提供机构:
camel-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作