chanwit/gemma4-cub-agent-v6
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/chanwit/gemma4-cub-agent-v6
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- gemma-4
- tool-calling
- reasoning
- thinking
- fine-tuning
- devops
- kubernetes
- confighub
- cub
- agent
size_categories:
- 1K<n<10K
---
# gemma4-cub-agent-v6
Training dataset for fine-tuning Gemma-4-31B-it as a DevOps agent with tool-calling and reasoning capabilities, specifically for managing Kubernetes clusters with [ConfigHub](https://confighub.com) (`cub` CLI).
## Why This Dataset
Fine-tuning Gemma-4 on reasoning-only data causes **catastrophic forgetting** of its native tool-calling. This dataset preserves both capabilities by mixing reasoning and tool-calling examples in Gemma-4's native token format.
## Composition
| Component | Examples | % |
|-----------|----------|---|
| Reasoning (coding, DevOps, kubectl, git) | 1362 | 72.6% |
| Tool-calling (cub, kubectl, argocd, flux) | 515 | 27.4% |
| **Total** | **1877** | |
## Stats
- **3184** tool calls across 515 examples
- **4456** thinking blocks across all examples
- **52** unique `cub` CLI commands verified against real ConfigHub server
- **All tool-calling examples** include `<|channel>thought` reasoning before every `<|tool_call>`
- Average **4,292** chars/example
## Tool-Calling Coverage
Covers 8 areas of ConfigHub management:
- **Unit lifecycle**: create, apply, refresh, diff, restore, approve, destroy
- **Functions**: get-replicas, set-image, yq-i, search-replace, vet-celexpr, etc.
- **Drift detection**: refresh, diff, livestate, reconciliation
- **GitOps**: ArgoCD + Flux via OCI bridge
- **Helm**: install, upgrade, template through ConfigHub
- **Workers**: create, install, status, logs
- **Multi-cluster**: push-upgrade, cross-space apply, changesets
- **Import**: adopt live resources, clean manifests, avoid SSA conflicts
Also covers kubectl troubleshooting (pod failures, OOMKill, CrashLoop, 502s, rollbacks), ArgoCD sync issues, Flux stuck kustomizations, and general coding tasks.
## Format
Native Gemma-4 tokens — ready for `SFTTrainer` with `train_on_responses_only`:
```
<bos><|turn>system
<|think|>System prompt<|tool>declaration:Bash{...}<tool|><turn|>
<|turn>user
User request<turn|>
<|turn>model
<|channel>thought
Step-by-step reasoning
<channel|><|tool_call>call:Bash{command:<|"|>kubectl get pods<|"|>}<tool_call|><turn|>
<|turn>tool
Command output<turn|>
<|turn>model
Final answer<turn|>
```
## Usage
```python
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
dataset = load_dataset("chanwit/gemma4-cub-agent-v6")
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
args=SFTConfig(
dataset_text_field="text",
max_seq_length=8192,
...
),
)
# Mask user/tool turns, train on model turns only
from trl import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|turn>user\n",
response_part="<|turn>model\n",
)
```
## Data Sources
- **Reasoning**: Filtered from Opus/Qwen reasoning datasets (coding, DevOps, kubectl, git focus)
- **Tool-calling**: Generated with Claude Opus, converted via hybrid Gemma-4 formatter
- **DevOps reasoning**: 30 kubectl/git/infrastructure entries generated with thinking blocks
- **All cub commands verified** by executing against real ConfigHub server + kind cluster
## License
Apache 2.0
---
license: Apache-2.0
语言:
- 英语
标签:
- gemma-4
- 工具调用
- 推理
- 思考
- 微调
- DevOps
- Kubernetes
- ConfigHub
- cub
- AI智能体
数据规模区间:
- 1000 < 样本数 < 10000
---
# gemma4-cub-agent-v6 数据集
本数据集用于对Gemma-4-31B-it进行微调,使其成为具备工具调用与推理能力的DevOps智能体,专门用于通过[ConfigHub(ConfigHub)](https://confighub.com)(`cub` 命令行工具)管理Kubernetes集群。
## 数据集设计初衷
仅基于推理数据对Gemma-4进行微调会导致其原生工具调用能力出现**灾难性遗忘**。本数据集通过以Gemma-4原生Token(Token)格式混合推理与工具调用示例,同时保留两项核心能力。
## 数据集构成
| 组件类型 | 示例数量 | 占比 |
|------------------------|----------|-------|
| 推理类(编码、DevOps、kubectl、git) | 1362 | 72.6% |
| 工具调用类(cub、kubectl、ArgoCD、Flux) | 515 | 27.4% |
| **总计** | **1877** | —— |
## 统计信息
- 515个工具调用示例中共包含**3184次**工具调用
- 所有示例中共包含**4456个**推理块
- 针对真实ConfigHub服务器验证了**52种**唯一`cub`命令行指令
- **所有工具调用示例**均在每一处`<|tool_call>`前包含`<|channel>thought`格式的推理内容
- 单示例平均字符数为**4292**
## 工具调用覆盖场景
本数据集覆盖ConfigHub管理的8大场景:
- **单元生命周期管理**:创建、应用、刷新、对比、恢复、审批、销毁
- **功能操作**:获取副本数、设置镜像、yq-i、搜索替换、验证celexpr等
- **漂移检测**:刷新、对比、实时状态、协调同步
- **GitOps流程**:通过OCI桥接实现ArgoCD与Flux集成
- **Helm管理**:通过ConfigHub执行安装、升级、模板渲染
- **工作节点管理**:创建、安装、查看状态、获取日志
- **多集群管理**:推送升级、跨空间应用、变更集操作
- **资源导入**:接管现有资源、清理清单、避免SSA冲突
同时覆盖kubectl故障排查(Pod故障、OOMKill、CrashLoop、502错误、回滚)、ArgoCD同步异常、Flux卡住的Kustomizations以及通用编码任务。
## 数据格式
采用Gemma-4原生Token(Token)格式,可直接用于配置了`train_on_responses_only`参数的`SFTTrainer`:
<bos><|turn>system
<|think|>系统提示<|tool>declaration:Bash{...}<tool|><turn|>
<|turn>user
用户请求<turn|>
<|turn>model
<|channel>thought
分步推理过程
<channel|><|tool_call>call:Bash{command:<|"|>kubectl get pods<|"|>}<tool_call|><turn|>
<|turn>tool
命令输出<turn|>
<|turn>model
最终答案<turn|>
## 使用方式
python
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
dataset = load_dataset("chanwit/gemma4-cub-agent-v6")
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
args=SFTConfig(
dataset_text_field="text",
max_seq_length=8192,
...
),
)
# 仅对模型回复部分进行掩码训练
from trl import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|turn>user
",
response_part="<|turn>model
",
)
## 数据来源
- **推理类数据**:从Opus/Qwen推理数据集中筛选而来,聚焦编码、DevOps、kubectl、git场景
- **工具调用类数据**:由Claude Opus生成,通过混合式Gemma-4格式化工具转换得到
- **DevOps推理数据**:生成了30条带推理块的kubectl/git/基础设施相关条目
- **所有cub命令均通过真实ConfigHub服务器与Kind集群验证**
## 许可证
Apache 2.0
提供机构:
chanwit



