five

WithinUsAI/GOD_Coder_Complete_DataSet

收藏
Hugging Face2026-03-29 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/WithinUsAI/GOD_Coder_Complete_DataSet
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other license_name: within-us-ai-custom-dataset-license pretty_name: GOD_Coder_Complete_DataSet size_categories: - 100K<n<1M task_categories: - text-generation - question-answering - text-classification tags: - code - coding - software-engineering - instruction-tuning - sft - ai-coding - complete-project-coding - repository-patching - debugging - dependency-resolution - full-stack-engineering - code-review - dataset annotations_creators: - machine-generated - expert-generated language_creators: - machine-generated multilinguality: - monolingual source_datasets: - original viewer: false --- # GOD_Coder_Complete_DataSet ## Subtitle A large-scale complete-project coding dataset by **gss1147 / WithIn Us AI**, built to train language models into stronger professional software-engineering assistants. ## Dataset Summary **GOD_Coder_Complete_DataSet** is a large synthetic supervised fine-tuning dataset designed to help turn a general language model into a **professional complete-project AI coder**. The dataset focuses on teaching models how to: - diagnose realistic repository issues - patch broken code with production-ready fixes - write and repair tests - handle dependency and migration failures - reason across full software stacks - solve advanced coding-logic problems - behave more like a senior engineer on complete software projects This dataset was created by **gss1147** under **WithIn Us AI**. ## Creator - **Creator:** gss1147 - **Organization / Brand:** WithIn Us AI - **Dataset Concept, Design, Structure, and Packaging:** WithIn Us AI - **Primary Author:** gss1147 ## License This dataset uses the **WithIn Us AI Custom Dataset License**. ## Dataset Purpose The purpose of this dataset is to provide a strong supervised fine-tuning resource for training coding-capable LLMs toward: - complete software-project reasoning - professional engineering behavior - multi-file patch generation - debugging and issue resolution - test-backed implementation quality - dependency-aware coding - rollout-safe software delivery - increasingly advanced coding logic This dataset is intended for researchers, model builders, and fine-tuning practitioners who want a model that behaves more like a **real software engineer**, not just a code autocompleter. ## Supported Tasks This dataset is suitable for: - supervised fine-tuning - instruction tuning - coding assistant specialization - software-engineering behavior shaping - repository issue repair - debugging assistance - dependency resolution training - software delivery planning - code review improvement - complete-project coding workflows ## Dataset Structure The dataset is organized into **7 major subject groups**, each containing **25,000 examples**, for a total of **175,000 rows**. ### Subject Groups 1. **AI Coding** 2. **AI Dependency Coding** 3. **AI Coding Stacks** 4. **AI Software Development** 5. **AI Coding Logic Master** 6. **AI Coding Logic Legendary** 7. **AI Coding Logic God** ### Total Size - **Total examples:** 175,000 - **Train examples:** 171,500 - **Validation examples:** 3,500 ## Data Format Each example is stored in **chat-format JSONL** and includes: - `id` - `subject` - `subject_title` - `tier` - `language` - `framework` - `stack` - `domain` - `topic` - `task_type` - `split` - `freshness_bucket` - `source_grounding` - `messages` - `artifacts` - `labels` ### Example Schema ```json { "id": "ai_coding-00001-abcdef1234567890", "subject": "ai_coding", "subject_title": "AI Coding", "tier": "hard", "language": "Python", "framework": "FastAPI", "stack": ["FastAPI", "PostgreSQL", "Redis", "Celery", "pytest", "Docker"], "domain": "auth service", "topic": "JWT refresh token rotation", "task_type": "repo_issue_patch", "split": "train", "freshness_bucket": "synthetic_transformed_post_2025_style", "source_grounding": { "kind": "synthetic_transformed_repo_task", "license_ok": true, "provenance_note": "Synthetic training example designed for coding-model SFT and labeled as synthetic." }, "messages": [ { "role": "system", "content": "You are a production-grade software engineer. Return a correct, secure, complete, test-backed solution with concise reasoning and no placeholders." }, { "role": "user", "content": "Repository domain: auth service..." }, { "role": "assistant", "content": "Diagnosis... implementation... tests... verification..." } ], "artifacts": { "verification_commands": ["pytest -q", "ruff check ."], "requires_tests": true, "format": "chat_sft" }, "labels": { "correctness": 1, "security": 1, "production_ready": 1, "test_quality": 1, "complete_project_focus": 1 } } Languages Covered The dataset includes tasks across multiple coding and infrastructure languages, including: • Python • TypeScript • JavaScript • Go • Rust • Java • C# • C++ • SQL • Bash • YAML Content Overview The dataset emphasizes production-style software engineering. It includes examples involving: • bug fixing • feature implementation • code review correction • API design • dependency resolution • version migration repair • lockfile and reproducibility debugging • full-stack issue handling • rollout-safe software delivery • incident remediation • concurrency and logic debugging • performance bottleneck repair • multi-file patching • security hardening • observability-aware engineering Data Generation Method This dataset was created as a synthetic structured coding dataset for fine-tuning and instruction-tuning purposes. The generation process focused on: • professional software-engineering style prompts • complete implementation responses • test-backed solutions • production-oriented reasoning • multi-stack coverage • advanced logic difficulty bands • complete-project engineering behavior Examples were designed to reflect realistic repository and engineering scenarios while remaining clearly labeled as synthetic. Why This Dataset Exists Many coding datasets over-focus on: • short single-function code tasks • toy algorithm problems • incomplete snippets • beginner-level instruction pairs GOD_Coder_Complete_DataSet was created to push beyond that by training models on: • complete-project coding behavior • software-engineering decision quality • professional debugging patterns • multi-layer issue resolution • deployment-safe thinking • engineering-grade patch quality Intended Use This dataset is intended for: • full-model fine-tuning • instruction tuning • coding model specialization • research into software-engineering-capable LLMs • training models that can operate more effectively in repository-style workflows It is especially relevant for users building: • coding copilots • patch-generation systems • engineering support agents • code-review assistants • debugging assistants • full-stack project agents Recommended Training Uses Recommended uses include: • supervised fine-tuning on chat-formatted LLMs • continued instruction tuning for coding behavior • staged curriculum learning across difficulty tiers • subject-wise training by shard • multi-phase training where foundational coding precedes advanced logic tiers Suggested Progression 1. AI Coding 2. AI Dependency Coding 3. AI Coding Stacks 4. AI Software Development 5. AI Coding Logic Master 6. AI Coding Logic Legendary 7. AI Coding Logic God Source Data • Source Type: Original dataset created by WithIn Us AI • Primary Creator: gss1147 • Dataset Design: WithIn Us AI • Origin: Synthetic and structured software-engineering task generation Data Splits • Train: 171,500 • Validation: 3,500 The split is tracked using the split field inside each example. Dataset Strengths • large-scale • complete-project focus • professional engineering framing • multi-language coverage • test-backed outputs • multi-subject structure • strong software-development emphasis • suited for coding-model specialization • useful for curriculum-based fine-tuning Dataset Limitations • synthetic rather than extracted from real private repositories • does not guarantee novelty against all historic model pretraining corpora • should be combined with careful evaluation • should ideally be paired with held-out benchmark testing • should not be treated as a substitute for licensed real-world patch datasets where available Bias, Risks, and Safety Quality Philosophy The dataset was designed around these principles: • no placeholders • complete answers • production-ready orientation • secure-by-default thinking • tests included as a training signal • full-project engineering mindset • patch and verification awareness Citation BibTeX @dataset{gss1147_god_coder_complete_dataset_2026, author = {gss1147 and WithIn Us AI}, title = {GOD_Coder_Complete_DataSet}, year = {2026}, publisher = {Hugging Face}, note = {Synthetic supervised fine-tuning dataset for professional complete-project AI coding} } Acknowledgment GOD_Coder_Complete_DataSet was created by gss1147 under WithIn Us AI as part of a broader effort to build stronger open coding-focused AI systems with professional software-engineering behavior. Here is the only YAML fix that mattered: ```yaml license: other license_name: within-us-ai-custom-dataset-license
提供机构:
WithinUsAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作