five

odytrice/kenichi-sft

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/odytrice/kenichi-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-generation tags: - code - fsharp - svelte - typescript - dotnet - docker - kubernetes - distillation - instruction-tuning pretty_name: Kenichi SFT size_categories: - 1K<n<10K --- # Kenichi SFT -- Multi-Teacher Distilled F# / Full-Stack Coding Dataset > Named after the anime **"Kenichi: The Mightiest Disciple"** -- a student who trains under multiple masters to become the strongest. A domain-specialized instruction-tuning dataset distilled from three teacher models for training F#-focused coding LLMs. All F# samples are compiler-verified. ## Student Models (Intended Use) | Model | Base | Role | Context | |-------|------|------|---------| | **Kenichi Thinking** | Qwen3.5-27B | Reasoning-first, `<think>` mode | 256K | | **Kenichi Flash** | Devstral Small 2 (24B) | Fast agentic coding | 128K | Both target local inference on 32GB VRAM. ## Dataset Stats | Metric | Value | |--------|-------| | Total samples | 7,953 | | Train split | 7,556 | | Validation split | 397 | | Formats | ChatML (Qwen) + Mistral Instruct (Devstral) | | F# compiler verified | Yes (all F# samples) | | Generation rounds | 3 + substitute + instruction fixes | ## Domain Distribution | Domain | Samples | % | Description | |--------|---------|---|-------------| | fsharp_libraries | 2,096 | 26.4% | Giraffe, FsToolkit, Akka.NET, linq2db, Thoth.Json, FsCheck, Expecto, Argu, Dapper.FSharp, Farmer, Bolero, FSharpPlus, FParsec, and 20+ libraries | | fsharp_core | 1,817 | 22.8% | DUs, pattern matching, CEs, SRTP, agents, type providers, signature files, object expressions, CQRS, FParsec, Ports & Adapters | | general_coding | 950 | 11.9% | Algorithms, data structures, design patterns (450 distilled + 500 OpenCodeInstruct) | | dotnet_aspnet | 844 | 10.6% | ASP.NET Core with F#, DI, middleware, auth, health checks, gRPC, .NET Aspire, Redis caching, Polly resilience, API versioning | | svelte_typescript | 676 | 8.5% | Svelte 5 runes, SvelteKit 2, TypeScript patterns | | cross_domain | 610 | 7.7% | Full-stack F# + Svelte + Docker integration | | docker_kubernetes | 414 | 5.2% | Dockerfiles, K8s manifests, Helm, CI/CD | | agentic_swe | 279 | 3.5% | Multi-step debugging, refactoring, migration tasks | | long_context | 267 | 3.4% | Full project walkthroughs, multi-file implementations | F# total (core + libraries): 3,913 samples (49.2%) -- intentionally high given F#'s scarcity in pre-training data (<0.1% of The Stack v2). ## Teacher Models Teacher assignments were empirically determined through head-to-head F# compiler verification benchmarks on 549 prompts across 4 teacher models. | Teacher | Params | Domains | F# Pass Rate | Role | |---------|--------|---------|-------------|------| | **MiniMax M2.7** | 229B MoE | F# core, F# libraries | 76.6% | F# specialist | | **GLM-5** | 744B MoE | .NET/ASP.NET, Docker/K8s, agentic, general | 97.1% (dotnet) | .NET/general powerhouse | | **Kimi K2.5** | -- | Svelte/TypeScript, cross-domain, long-context | 90.0% (cross-domain) | Frontend/long-context specialist | | *DeepSeek V3.2* | 685B MoE | *(Round 1 only, replaced after benchmarking)* | 43.1% | *Retired* | ### Why MiniMax for F#? Benchmark results on 427 F# prompts that DeepSeek failed on: | Teacher | Passed | Pass Rate | Skip Rate | |---------|--------|-----------|-----------| | MiniMax M2.7 | 327/427 | **76.6%** | 0.7% | | GLM-5 | 149/427 | 70.6% | 3.5% | | Kimi K2.5 | 149/427 | 34.9% | 41.2% | MiniMax's near-zero skip rate (almost always generates code) combined with the highest pass rate made it the clear choice for F# domains. ## Data Generation Pipeline ``` 185 seed prompts (9 domains) | v expand_prompts.py (30 variations per seed via teachers) | v ~5,169 unique expanded prompts | v generate_data.py (2 rounds: default temp + temp 0.9) | v ~9,138 raw responses (4,569 per round) | v verify_fsharp.py (F# compiler verification) | v format_dataset.py (ChatML + Mistral formats) | v 7,953 verified training samples ``` ### Five Generation Rounds - **Round 1**: Default teacher temperatures (0.4-0.7), original teacher assignments - **Round 2**: Temperature 0.9, optimized teacher assignments based on benchmark results - **Round 3**: 20 curriculum gap-fill seeds (SRTP, FsCheck, Expecto, gRPC, .NET Aspire, Argu, Redis, Polly, outbox pattern, RabbitMQ, ETL, Bolero, object expressions, signature files, CQRS, FParsec, Ports & Adapters, API versioning, Dapper.FSharp, Farmer) - **Substitute round**: Re-runs of failing F# prompts through alternate teachers (1,465 generated, 912 passed) - **Instruction fix round**: 20 prompts with patched instructions re-generated through MiniMax Running the same prompts at different temperatures with different teachers produces structurally diverse solutions to the same problems, improving student generalization. ### F# Compiler Verification All samples containing F# code are verified through a two-stage pipeline: 1. **Compile check**: Code extracted from teacher responses and compiled via `dotnet fsi` (scripts) or `dotnet build` (namespace/module code) 2. **Execution check**: Samples with test assertions are executed to verify runtime correctness The verification project includes 45+ NuGet packages: Giraffe, FsToolkit.ErrorHandling, Akka.NET, linq2db, Serilog, Thoth.Json.Net, FSharp.SystemTextJson, FSharp.Control.AsyncSeq, FSharpPlus, MathNet.Numerics, FSharp.Text.RegexProvider, and more. Samples that fail compilation are excluded from the dataset. Key verification improvements developed during the project: - Truncated response extraction (handles unclosed code fences from max_token cutoffs) - Namespace/module routing (routes `namespace X` code through project build instead of .fsx) - Multi-block conflict resolution (uses largest block when multiple blocks have conflicting declarations) - NuGet indicator matching (broad pattern matching routes library code to project build with NuGet packages) - Targeted code fixes (15 samples with minor syntax errors fixed and re-verified) ### Supplemental Data 500 samples from [NVIDIA OpenCodeInstruct](https://huggingface.co/datasets/nvidia/OpenCodeInstruct) (5M Python coding samples), filtered with strict quality thresholds: - Unit test pass rate >= 0.9 (passes 9+ of 10 tests) - LLM judgement scores >= 4/5 on requirement conformance, logical correctness, and edge case consideration ## Formats Two formats are provided for different model families: ### ChatML (for Qwen3.5 / Kenichi Thinking) Split names: `chatml_train`, `chatml_val` ```json { "messages": [ {"role": "user", "content": "Write an F# discriminated union..."}, {"role": "assistant", "content": "Here's an F# DU for..."} ], "id": "fsharp_core_0001_exp_003", "domain": "fsharp_core", "teacher": "minimax" } ``` ### Mistral Instruct (for Devstral Small 2 / Kenichi Flash) Split names: `mistral_train`, `mistral_val` Same `messages` structure -- the Mistral special tokens (`[INST]`, `[/INST]`) are applied at training time by the tokenizer when `chat_template="mistral"` is set. ## Usage ```python from datasets import load_dataset # For Qwen3.5 / ChatML models ds = load_dataset("odytrice/kenichi-sft", split="chatml_train") # For Devstral / Mistral models ds = load_dataset("odytrice/kenichi-sft", split="mistral_train") # Filter by domain fsharp_ds = ds.filter(lambda x: x["domain"].startswith("fsharp")) # Filter by teacher minimax_ds = ds.filter(lambda x: x["teacher"] == "minimax") ``` ## License Apache 2.0 ## Acknowledgments - **Teacher models**: MiniMax M2.7, GLM-5, Kimi K2.5, DeepSeek V3.2 - **Supplemental data**: NVIDIA OpenCodeInstruct (CC-BY-4.0) - **Infrastructure**: Ollama Max subscription for teacher inference - **F# verification**: .NET SDK with 30+ NuGet packages
提供机构:
odytrice
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作