Crownelius/High-Coder-SFT-Medium

Name: Crownelius/High-Coder-SFT-Medium
Creator: Crownelius
Published: 2026-03-16 00:36:36
License: 暂无描述

Hugging Face2026-03-16 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Crownelius/High-Coder-SFT-Medium

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en tags: - code - synthetic - high-quality - multi-language - sft - long-form - production-code - hunter-alpha size_categories: - 100K<n<1M --- [<img src="https://huggingface.co/crownelius/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5/resolve/main/banner.png" width="350"/>](https://ko-fi.com/abcuo) # High-Coder-SFT-Medium A high-quality synthetic code dataset containing **124,818 long-form code samples** across 8 programming languages. Generated using **Hunter Alpha** (1T+ parameter frontier model). Every single sample contains at least 200 lines of actual code — most contain 500+. This is not a snippet dataset. Every file is a complete, production-quality source file with imports, error handling, design patterns, and modern language idioms. The average sample is **630 lines of code** and **830 total lines**. ## Why This Dataset? Most code datasets fall into one of two traps: they're either huge collections of short snippets (50-100 lines) scraped from GitHub, or they're small hand-curated sets that lack diversity. **High-Coder-SFT-Medium solves both problems.** ### vs. GitHub-scraped datasets (StarCoder, The Stack, etc.) - **No license ambiguity.** Every sample is synthetically generated under MIT — no GPL contamination, no attribution chains, no legal gray areas. - **Consistently long.** 99.5% of samples exceed 300 LOC. GitHub scrapes are dominated by short utility files, configs, and boilerplate. - **No junk.** No auto-generated files, no minified bundles, no lock files, no vendored dependencies. Every sample is intentional, meaningful code. - **Modern idioms only.** No legacy patterns, no deprecated APIs, no Python 2 or jQuery. Every sample uses current best practices (C++20, PHP 8.3, ES2024+, etc.). ### vs. Other synthetic code datasets - **Actually long.** Average 630 LOC vs. typical synthetic datasets that average 50-150 lines. - **Architectural diversity.** 160+ domain templates across 8 languages combined with 80+ design patterns, plus cross-language project scenarios. Not just "write a function that..." - **Production structure.** Complete files with imports, namespaces, error handling, configuration, and proper module boundaries — not isolated functions. - **Balanced distribution.** Near-equal representation across all 8 languages, not 80% Python. ### vs. Instruction-tuning code datasets (CodeAlpaca, etc.) - **Not Q&A format.** Raw source files, not "explain this code" or "write a function that adds two numbers." Models trained on this learn to write real software, not answer homework. - **Scale.** 124,818 samples at 630 avg LOC = **78.7 million lines of code**. That's a serious training signal. ## Dataset Summary | | | |---|---| | **Total Samples** | 124,818 | | **Total Lines of Code** | 78,716,115 | | **Total Lines (incl. blanks/comments)** | 103,720,088 | | **Average LOC per sample** | 630 | | **Average total lines per sample** | 830 | | **Min LOC** | 200 | | **Max LOC** | 2,117 | | **Model** | openrouter/hunter-alpha (1T+ params) | | **Format** | JSONL | | **Size on disk** | 3.3 GB | | **License** | MIT | | **Deduplicated** | Yes (SHA-256) | ## Language Distribution | Language | Samples | Avg LOC | Avg Total Lines | |----------|---------|---------|-----------------| | C# | 16,927 | 654 | 845 | | C++ | 16,699 | 588 | 791 | | JavaScript | 16,379 | 671 | 945 | | TypeScript | 16,277 | 657 | 824 | | Java | 16,204 | 561 | 723 | | PHP | 14,347 | 757 | 996 | | Go | 14,119 | 562 | 746 | | Rust | 13,866 | 592 | 773 | ## LOC Distribution | LOC Range | Samples | Percentage | |-----------|---------|------------| | 200-300 | 600 | 0.5% | | 300-500 | 29,401 | 23.6% | | 500-1000 | 89,738 | 71.9% | | 1000+ | 5,079 | 4.1% | Over **76%** of samples exceed 500 lines of code. This is not a snippets dataset. ## Domain Coverage Each language has 20+ specialized domain templates covering real-world software: - **Web & APIs:** REST controllers, GraphQL servers, gRPC services, middleware pipelines, WebSocket handlers - **Data & Storage:** ORM repositories, database engines, cache managers, migration systems, key-value stores - **Concurrency:** Thread pools, lock-free data structures, actor systems, async runtimes, worker pools - **Systems:** Memory allocators, network protocols, file system watchers, container runtimes, DNS resolvers - **Architecture:** Event sourcing, CQRS, saga orchestration, plugin systems, rule engines - **Performance:** SIMD processing, zero-copy serialization, memory pooling, expression templates - **Security:** Authentication, OAuth2/OIDC, JWT, encryption, rate limiting - **Testing:** Test frameworks, property-based testing, fixtures, mocking, snapshot testing - **DevTools:** Build systems, CLI tools, code analyzers, bundler plugins, linters - **Domain-specific:** E-commerce, chat systems, monitoring, IoT, booking, analytics ### Design Patterns Applied Repository, CQRS, Factory, Builder, Observer, Strategy, Decorator, Mediator, Chain of Responsibility, Visitor, Command, State, Proxy, Adapter, Composite, Specification, Unit of Work, RAII, CRTP, Pimpl, Type Erasure, Policy-Based Design, Typestate, Functional Options, Middleware Chain, Fan-Out/Fan-In, and more. ## Dataset Structure Each sample is a JSON object with full provenance: ```json { "sample_id": "a1b2c3d4e5f67890", "language": "rust", "long_criteria": { "loc": 587, "total_lines": 742, "meets_loc_200": true }, "provenance": { "source_platform": "synthetic", "model": "openrouter/hunter-alpha", "prompt": "Write a complete Rust source file that implements: async HTTP server with tower middleware...", "generated_at": "2026-03-15T00:42:17.123456+00:00" }, "licensing": { "license_spdx": "MIT", "redistribution_allowed": true }, "quality_signals": { "has_imports": true, "synthetic": true }, "dedup": { "sha256": "full_64char_content_hash" }, "tokens": { "input_tokens": 198, "output_tokens": 6421 }, "content": { "text": "use tokio::net::TcpListener;\nuse tower::ServiceBuilder;\n..." } } ``` ## Quality Signals - **69.1%** of samples have detected import statements (remaining 30.9% use language-specific module systems like Go's `package` declarations or C++ headers that start with `#include` after initial comments) - **100%** meet the 200 LOC minimum threshold - **SHA-256 deduplicated** — zero duplicate content - **Markdown-stripped** — no code fences or explanatory text, pure source code - **Retry-enhanced** — samples that initially fell short were regenerated with stronger prompts ## Potential Uses ### Fine-Tuning Code LLMs Train or fine-tune language models to generate long, complete source files rather than short snippets. Models trained on this data learn file-level structure: proper imports, namespace organization, class hierarchies, and module boundaries. ### Multi-Language Code Generation Balanced 8-language distribution means models won't overfit to Python. Ideal for training polyglot code assistants that handle C#, TypeScript, C++, Java, JavaScript, Go, Rust, and PHP equally well. ### Supervised Fine-Tuning (SFT) Each sample includes the generation prompt in `provenance.prompt`, making it directly usable as instruction-response pairs for SFT. The prompts specify language, domain, design pattern, and quality requirements. ### Long-Context Code Training With an average of 830 total lines per sample, this dataset trains models to maintain coherence, consistency, and correctness across long outputs — a critical weakness in most code models. ### Code Architecture Understanding The diversity of design patterns, architectural styles, and domain-specific implementations helps models learn when and how to apply patterns like CQRS, event sourcing, repository pattern, etc. ### Distillation Transfer the coding capabilities of Hunter Alpha (1T+ params) into smaller, deployable models. The structured prompts and high-quality outputs make this ideal distillation data. ### Evaluation & Benchmarking Use as a reference corpus for evaluating code generation quality, completeness, and correctness across languages and complexity levels. ### Curriculum Learning The LOC distribution (200-2,117 lines) provides a natural difficulty gradient for curriculum-based training strategies. ## Generation Pipeline Built with a custom async Python pipeline: 1. **Prompt Generation:** 160+ domain-specific templates composed with design patterns and language-specific extras. 30% of prompts use cross-language project scenarios for additional diversity. 2. **Parallel Generation:** 50 API keys x 10 concurrent requests = 500 parallel generations via OpenRouter. 3. **Quality Filter:** Samples below 200 LOC are retried with a stronger prompt emphasizing length requirements. 4. **Deduplication:** SHA-256 content hashing rejects exact duplicates. 5. **Checkpoint/Resume:** Progress saved every 10 samples for fault tolerance. ## Limitations - All code is synthetically generated and has not been compiled or executed — some samples may contain subtle bugs or type errors. - The model may occasionally produce plausible-looking but incorrect implementations of complex algorithms. - No test coverage data or runtime verification is included. - Biased toward backend/systems code — frontend UI code (HTML/CSS) is underrepresented. - Single-file scope — no multi-file project structures or cross-file dependencies. ## Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("crownelius/High-Coder-SFT-Medium") # Filter by language rust_samples = [s for s in dataset['train'] if s['language'] == 'rust'] # Filter long samples (1000+ LOC) long_samples = [s for s in dataset['train'] if s['long_criteria']['loc'] >= 1000] # Use as SFT pairs for sample in dataset['train']: prompt = sample['provenance']['prompt'] code = sample['content']['text'] # ... your training loop ``` ## Citation ```bibtex @dataset{high_coder_sft, title={High-Coder-SFT-Medium: 125K Long-Form Code Samples Across 8 Languages}, author={Crownelius}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/crownelius/High-Coder-SFT-Medium} } ``` --- ## Stats | Metric | Value | |--------|-------| | Total prompt tokens | 25,018,582 | | Total completion tokens | 808,150,733 | | Total tokens | 833,169,315 | | Total cost | $0.00 (USD) | | Average turns | 1.00 | | Average tool calls | 0.00 | | Average tokens per row | 6,675.07 | *Cost estimated using Hunter Alpha pricing on [OpenRouter](https://openrouter.ai) ($0.0/M input, $0.0/M output) — this model is free*

license: MIT协议任务类别： - 文本生成语言： - 英语标签： - 代码 - 合成数据 - 高质量 - 多语言 - 监督微调（Supervised Fine-Tuning，SFT） - 长格式 - 生产级代码 - Hunter Alpha 样本规模区间： - 10万 < 样本数 < 100万 [![](https://huggingface.co/crownelius/Crow-9B-Opus-4.6-Distill-Heretic_Qwen3.5/resolve/main/banner.png)](https://ko-fi.com/abcuo) # High-Coder-SFT-Medium 这是一份高质量合成代码数据集，包含覆盖8种编程语言的**124818条长格式代码样本**，由**Hunter Alpha**（千亿参数级前沿大模型）生成。所有样本均包含至少200行实际代码，多数样本代码行数超过500行。本数据集绝非代码片段合集。每一个文件均为完整的生产级源文件，包含导入语句、错误处理、设计模式以及符合现代语言规范的惯用写法。单样本平均代码行数为**630行**，总行数（含注释与空行）为**830行**。 ## 数据集研发初衷当前主流代码数据集普遍存在两类缺陷：要么是从GitHub爬取的海量短代码片段（50~100行）合集，要么是多样性不足的小型人工精选数据集。**High-Coder-SFT-Medium 同时解决了上述两类问题。** ### 对比GitHub爬取类数据集（如StarCoder、The Stack等） - **无许可证歧义**：所有样本均基于MIT协议合成生成，不存在GPL协议污染、版权追溯链混乱或法律灰色地带问题。 - **统一长格式**：99.5%的样本代码行数超过300行，而GitHub爬取数据集以短小的工具文件、配置文件与模板代码为主。 - **无冗余内容**：不含自动生成文件、压缩打包文件、锁文件或第三方依赖副本，所有样本均为有明确用途的有效代码。 - **仅使用现代语言规范**：不包含旧式编程模式、废弃API、Python 2或jQuery等过时技术，所有样本均采用当前行业最佳实践（如C++20、PHP 8.3、ES2024+等）。 ### 对比其他合成代码数据集 - **真正的长格式代码**：单样本平均代码行数达630行，远高于多数合成数据集50~150行的平均水平。 - **架构多样性丰富**：覆盖8种语言的160余种领域模板，搭配80余种设计模式，同时包含跨语言项目场景，绝非“编写XX功能函数”这类简单任务。 - **生产级代码结构**：完整的文件结构，包含导入语句、命名空间、错误处理、配置项与规范的模块边界，而非孤立的函数代码。 - **语言分布均衡**：8种编程语言的样本占比近乎均等，不存在Python占比高达80%的倾斜问题。 ### 对比指令微调类代码数据集（如CodeAlpaca等） - **非问答格式**：采用原生源文件格式，而非“解释这段代码”或“编写两数相加函数”这类指令式任务。基于本数据集训练的模型将学会编写完整软件，而非完成作业式问答。 - **训练规模可观**：124818条样本，平均单样本630行代码，总计**7870万行有效代码**，可提供强劲的训练信号。 ## 数据集概览 | 指标 | 数值 | |---|---| | **总样本数** | 124,818 | | **总代码行数** | 78,716,115 | | **总行数（含空行与注释）** | 103,720,088 | | **单样本平均代码行数** | 630 | | **单样本平均总行数** | 830 | | **最小代码行数** | 200 | | **最大代码行数** | 2,117 | | **生成模型** | openrouter/hunter-alpha（千亿参数） | | **数据格式** | JSONL | | **磁盘占用大小** | 3.3 GB | | **许可证** | MIT协议 | | **去重处理** | 是（基于SHA-256哈希） | ## 语言分布情况 | 编程语言 | 样本数量 | 平均代码行数 | 平均总行数 | |----------|---------|---------|-----------------| | C# | 16,927 | 654 | 845 | | C++ | 16,699 | 588 | 791 | | JavaScript | 16,379 | 671 | 945 | | TypeScript | 16,277 | 657 | 824 | | Java | 16,204 | 561 | 723 | | PHP | 14,347 | 757 | 996 | | Go | 14,119 | 562 | 746 | | Rust | 13,866 | 592 | 773 | ## 代码行数分布 | 代码行数区间 | 样本数量 | 占比 | |-----------|---------|------------| | 200~300行 | 600 | 0.5% | | 300~500行 | 29,401 | 23.6% | | 500~1000行 | 89,738 | 71.9% | | 1000行以上 | 5,079 | 4.1% | **超过76%的样本代码行数超过500行**，本数据集绝非代码片段合集。 ## 领域覆盖范围每种编程语言均包含20余种针对真实软件场景的专业化领域模板： - **Web与API开发**：REST控制器、GraphQL服务器、gRPC服务、中间件管道、WebSocket处理器 - **数据与存储**：ORM仓储、数据库引擎、缓存管理器、迁移系统、键值存储 - **并发编程**：线程池、无锁数据结构、Actor系统、异步运行时、工作池 - **系统编程**：内存分配器、网络协议、文件系统监视器、容器运行时、DNS解析器 - **软件架构**：事件溯源、CQRS（命令查询职责分离）、Saga编排、插件系统、规则引擎 - **性能优化**：SIMD处理、零拷贝序列化、内存池、表达式模板 - **安全**：身份认证、OAuth2/OIDC、JWT、加密、限流 - **测试**：测试框架、属性式测试、测试夹具、Mock工具、快照测试 - **开发工具**：构建系统、CLI工具、代码分析器、打包器插件、代码检查工具 - **特定领域应用**：电子商务、聊天系统、监控、物联网、预订系统、数据分析 ### 应用的设计模式仓储模式（Repository）、命令查询职责分离（CQRS）、工厂模式（Factory）、建造者模式（Builder）、观察者模式（Observer）、策略模式（Strategy）、装饰器模式（Decorator）、中介者模式（Mediator）、职责链模式（Chain of Responsibility）、访问者模式（Visitor）、命令模式（Command）、状态模式（State）、代理模式（Proxy）、适配器模式（Adapter）、组合模式（Composite）、规约模式（Specification）、工作单元模式（Unit of Work）、资源获取即初始化（RAII）、奇异递归模板模式（CRTP）、指针实现惯用法（Pimpl）、类型擦除（Type Erasure）、策略式设计（Policy-Based Design）、状态机类型（Typestate）、函数式选项模式（Functional Options）、中间件链（Middleware Chain）、扇入扇出模式（Fan-Out/Fan-In）等。 ## 数据集结构每条样本为包含完整溯源信息的JSON对象： json { "sample_id": "a1b2c3d4e5f67890", "language": "rust", "long_criteria": { "loc": 587, "total_lines": 742, "meets_loc_200": true }, "provenance": { "source_platform": "synthetic", "model": "openrouter/hunter-alpha", "prompt": "Write a complete Rust source file that implements: async HTTP server with tower middleware...", "generated_at": "2026-03-15T00:42:17.123456+00:00" }, "licensing": { "license_spdx": "MIT", "redistribution_allowed": true }, "quality_signals": { "has_imports": true, "synthetic": true }, "dedup": { "sha256": "full_64char_content_hash" }, "tokens": { "input_tokens": 198, "output_tokens": 6421 }, "content": { "text": "use tokio::net::TcpListener; use tower::ServiceBuilder; ..." } } ## 质量校验标准 - **69.1%的样本包含导入语句**（剩余30.9%的样本采用对应语言特有的模块系统，例如Go的`package`声明或C++初始注释后的`#include`头文件） - **100%的样本满足200行代码的最低要求** - **基于SHA-256哈希去重** — 无重复内容 - **已剥离Markdown格式** — 不含代码块标记或说明文本，仅保留纯源代码 - **重试增强机制** — 初始未达到代码行数要求的样本将使用强化了长度要求的提示词重新生成 ## 潜在应用场景 ### 代码大语言模型微调训练或微调大语言模型，使其能够生成长格式的完整源文件而非短代码片段。基于本数据集训练的模型将学会文件级代码结构：规范的导入语句、命名空间组织、类层级结构与模块边界。 ### 多语言代码生成 8种语言的均衡分布可避免模型过度拟合Python语言，非常适合训练可同时处理C#、TypeScript、C++、Java、JavaScript、Go、Rust与PHP的多语言代码助手。 ### 监督微调（Supervised Fine-Tuning，SFT）每条样本的`provenance.prompt`字段包含生成提示词，可直接作为指令-响应对用于监督微调。提示词明确指定了编程语言、领域、设计模式与质量要求。 ### 长上下文代码训练单样本平均830行的总长度，可用于训练模型在长输出中保持连贯性、一致性与正确性，这正是多数现有代码模型的关键短板。 ### 代码架构认知能力培养丰富的设计模式、架构风格与领域特定实现，可帮助模型学习何时以及如何应用CQRS、事件溯源、仓储模式等架构模式。 ### 模型蒸馏将Hunter Alpha（千亿参数模型）的编码能力迁移到更小的可部署模型中，结构化的提示词与高质量输出使其成为理想的蒸馏训练数据。 ### 模型评估与基准测试可作为参考语料库，用于评估不同语言与复杂度等级下的代码生成质量、完整性与正确性。 ### 课程式学习 200~2117行的代码行数分布提供了自然的难度梯度，适合基于课程式学习的训练策略。 ## 生成流水线采用自定义异步Python流水线构建： 1. **提示词生成**：基于160余种领域特定模板，结合设计模式与语言特定要求生成提示词，其中30%的提示词采用跨语言项目场景以提升多样性。 2. **并行生成**：通过OpenRouter平台，使用50个API密钥并行发起10个请求，总计500个并行生成任务。 3. **质量过滤**：代码行数不足200行的样本将使用强化长度要求的提示词重新生成。 4. **去重处理**：基于SHA-256内容哈希排除完全重复的样本。 5. **断点续存**：每生成10个样本保存一次进度，以支持故障容错。 ## 数据集局限性 - 所有代码均为合成生成，未经过编译或运行测试，部分样本可能包含细微的bug或类型错误。 - 模型偶尔可能生成看似合理但实际错误的复杂算法实现。 - 未包含测试覆盖率数据或运行时验证信息。 - 数据集偏向后端与系统编程代码，前端UI代码（HTML/CSS）占比不足。 - 仅包含单文件代码，未涉及多文件项目结构或跨文件依赖。 ## 数据集加载方法 python from datasets import load_dataset dataset = load_dataset("crownelius/High-Coder-SFT-Medium") # Filter by language rust_samples = [s for s in dataset['train'] if s['language'] == 'rust'] # Filter long samples (1000+ LOC) long_samples = [s for s in dataset['train'] if s['long_criteria']['loc'] >= 1000] # Use as SFT pairs for sample in dataset['train']: prompt = sample['provenance']['prompt'] code = sample['content']['text'] # ... your training loop ## 引用格式 bibtex @dataset{high_coder_sft, title={High-Coder-SFT-Medium: 125K Long-Form Code Samples Across 8 Languages}, author={Crownelius}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/crownelius/High-Coder-SFT-Medium} } --- ## 统计指标 | 指标 | 数值 | |--------|-------| | **总提示词Token数** | 25,018,582 | | **总完成Token数** | 808,150,733 | | **总Token数** | 833,169,315 | | **总生成成本** | 0.00美元 | | **平均对话轮次** | 1.00 | | **平均工具调用次数** | 0.00 | | **单样本平均Token数** | 6,675.07 | *成本基于OpenRouter平台上Hunter Alpha的定价（输入Token每百万0美元，输出Token每百万0美元）估算——该模型免费使用*

提供机构：

Crownelius

5,000+

优质数据集

54 个

任务类型

进入经典数据集