odytrice/kenichi-sft
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/odytrice/kenichi-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
tags:
- code
- fsharp
- svelte
- typescript
- dotnet
- docker
- kubernetes
- distillation
- instruction-tuning
pretty_name: Kenichi SFT
size_categories:
- 1K<n<10K
---
# Kenichi SFT -- Multi-Teacher Distilled F# / Full-Stack Coding Dataset
> Named after the anime **"Kenichi: The Mightiest Disciple"** -- a student who trains under multiple masters to become the strongest.
A domain-specialized instruction-tuning dataset distilled from three teacher models for training F#-focused coding LLMs. All F# samples are compiler-verified.
## Student Models (Intended Use)
| Model | Base | Role | Context |
|-------|------|------|---------|
| **Kenichi Thinking** | Qwen3.5-27B | Reasoning-first, `<think>` mode | 256K |
| **Kenichi Flash** | Devstral Small 2 (24B) | Fast agentic coding | 128K |
Both target local inference on 32GB VRAM.
## Dataset Stats
| Metric | Value |
|--------|-------|
| Total samples | 7,953 |
| Train split | 7,556 |
| Validation split | 397 |
| Formats | ChatML (Qwen) + Mistral Instruct (Devstral) |
| F# compiler verified | Yes (all F# samples) |
| Generation rounds | 3 + substitute + instruction fixes |
## Domain Distribution
| Domain | Samples | % | Description |
|--------|---------|---|-------------|
| fsharp_libraries | 2,096 | 26.4% | Giraffe, FsToolkit, Akka.NET, linq2db, Thoth.Json, FsCheck, Expecto, Argu, Dapper.FSharp, Farmer, Bolero, FSharpPlus, FParsec, and 20+ libraries |
| fsharp_core | 1,817 | 22.8% | DUs, pattern matching, CEs, SRTP, agents, type providers, signature files, object expressions, CQRS, FParsec, Ports & Adapters |
| general_coding | 950 | 11.9% | Algorithms, data structures, design patterns (450 distilled + 500 OpenCodeInstruct) |
| dotnet_aspnet | 844 | 10.6% | ASP.NET Core with F#, DI, middleware, auth, health checks, gRPC, .NET Aspire, Redis caching, Polly resilience, API versioning |
| svelte_typescript | 676 | 8.5% | Svelte 5 runes, SvelteKit 2, TypeScript patterns |
| cross_domain | 610 | 7.7% | Full-stack F# + Svelte + Docker integration |
| docker_kubernetes | 414 | 5.2% | Dockerfiles, K8s manifests, Helm, CI/CD |
| agentic_swe | 279 | 3.5% | Multi-step debugging, refactoring, migration tasks |
| long_context | 267 | 3.4% | Full project walkthroughs, multi-file implementations |
F# total (core + libraries): 3,913 samples (49.2%) -- intentionally high given F#'s scarcity in pre-training data (<0.1% of The Stack v2).
## Teacher Models
Teacher assignments were empirically determined through head-to-head F# compiler verification benchmarks on 549 prompts across 4 teacher models.
| Teacher | Params | Domains | F# Pass Rate | Role |
|---------|--------|---------|-------------|------|
| **MiniMax M2.7** | 229B MoE | F# core, F# libraries | 76.6% | F# specialist |
| **GLM-5** | 744B MoE | .NET/ASP.NET, Docker/K8s, agentic, general | 97.1% (dotnet) | .NET/general powerhouse |
| **Kimi K2.5** | -- | Svelte/TypeScript, cross-domain, long-context | 90.0% (cross-domain) | Frontend/long-context specialist |
| *DeepSeek V3.2* | 685B MoE | *(Round 1 only, replaced after benchmarking)* | 43.1% | *Retired* |
### Why MiniMax for F#?
Benchmark results on 427 F# prompts that DeepSeek failed on:
| Teacher | Passed | Pass Rate | Skip Rate |
|---------|--------|-----------|-----------|
| MiniMax M2.7 | 327/427 | **76.6%** | 0.7% |
| GLM-5 | 149/427 | 70.6% | 3.5% |
| Kimi K2.5 | 149/427 | 34.9% | 41.2% |
MiniMax's near-zero skip rate (almost always generates code) combined with the highest pass rate made it the clear choice for F# domains.
## Data Generation Pipeline
```
185 seed prompts (9 domains)
|
v
expand_prompts.py (30 variations per seed via teachers)
|
v
~5,169 unique expanded prompts
|
v
generate_data.py (2 rounds: default temp + temp 0.9)
|
v
~9,138 raw responses (4,569 per round)
|
v
verify_fsharp.py (F# compiler verification)
|
v
format_dataset.py (ChatML + Mistral formats)
|
v
7,953 verified training samples
```
### Five Generation Rounds
- **Round 1**: Default teacher temperatures (0.4-0.7), original teacher assignments
- **Round 2**: Temperature 0.9, optimized teacher assignments based on benchmark results
- **Round 3**: 20 curriculum gap-fill seeds (SRTP, FsCheck, Expecto, gRPC, .NET Aspire, Argu, Redis, Polly, outbox pattern, RabbitMQ, ETL, Bolero, object expressions, signature files, CQRS, FParsec, Ports & Adapters, API versioning, Dapper.FSharp, Farmer)
- **Substitute round**: Re-runs of failing F# prompts through alternate teachers (1,465 generated, 912 passed)
- **Instruction fix round**: 20 prompts with patched instructions re-generated through MiniMax
Running the same prompts at different temperatures with different teachers produces structurally diverse solutions to the same problems, improving student generalization.
### F# Compiler Verification
All samples containing F# code are verified through a two-stage pipeline:
1. **Compile check**: Code extracted from teacher responses and compiled via `dotnet fsi` (scripts) or `dotnet build` (namespace/module code)
2. **Execution check**: Samples with test assertions are executed to verify runtime correctness
The verification project includes 45+ NuGet packages: Giraffe, FsToolkit.ErrorHandling, Akka.NET, linq2db, Serilog, Thoth.Json.Net, FSharp.SystemTextJson, FSharp.Control.AsyncSeq, FSharpPlus, MathNet.Numerics, FSharp.Text.RegexProvider, and more.
Samples that fail compilation are excluded from the dataset. Key verification improvements developed during the project:
- Truncated response extraction (handles unclosed code fences from max_token cutoffs)
- Namespace/module routing (routes `namespace X` code through project build instead of .fsx)
- Multi-block conflict resolution (uses largest block when multiple blocks have conflicting declarations)
- NuGet indicator matching (broad pattern matching routes library code to project build with NuGet packages)
- Targeted code fixes (15 samples with minor syntax errors fixed and re-verified)
### Supplemental Data
500 samples from [NVIDIA OpenCodeInstruct](https://huggingface.co/datasets/nvidia/OpenCodeInstruct) (5M Python coding samples), filtered with strict quality thresholds:
- Unit test pass rate >= 0.9 (passes 9+ of 10 tests)
- LLM judgement scores >= 4/5 on requirement conformance, logical correctness, and edge case consideration
## Formats
Two formats are provided for different model families:
### ChatML (for Qwen3.5 / Kenichi Thinking)
Split names: `chatml_train`, `chatml_val`
```json
{
"messages": [
{"role": "user", "content": "Write an F# discriminated union..."},
{"role": "assistant", "content": "Here's an F# DU for..."}
],
"id": "fsharp_core_0001_exp_003",
"domain": "fsharp_core",
"teacher": "minimax"
}
```
### Mistral Instruct (for Devstral Small 2 / Kenichi Flash)
Split names: `mistral_train`, `mistral_val`
Same `messages` structure -- the Mistral special tokens (`[INST]`, `[/INST]`) are applied at training time by the tokenizer when `chat_template="mistral"` is set.
## Usage
```python
from datasets import load_dataset
# For Qwen3.5 / ChatML models
ds = load_dataset("odytrice/kenichi-sft", split="chatml_train")
# For Devstral / Mistral models
ds = load_dataset("odytrice/kenichi-sft", split="mistral_train")
# Filter by domain
fsharp_ds = ds.filter(lambda x: x["domain"].startswith("fsharp"))
# Filter by teacher
minimax_ds = ds.filter(lambda x: x["teacher"] == "minimax")
```
## License
Apache 2.0
## Acknowledgments
- **Teacher models**: MiniMax M2.7, GLM-5, Kimi K2.5, DeepSeek V3.2
- **Supplemental data**: NVIDIA OpenCodeInstruct (CC-BY-4.0)
- **Infrastructure**: Ollama Max subscription for teacher inference
- **F# verification**: .NET SDK with 30+ NuGet packages
提供机构:
odytrice



