anupambayen/AnupamB-Coder-Dataset

Name: anupambayen/AnupamB-Coder-Dataset
Creator: anupambayen
Published: 2026-03-22 13:36:49
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/anupambayen/AnupamB-Coder-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en tags: - code - python - sql - synthetic - instruction-tuning - code-generation - text-to-sql - algorithms - data-structures - llm - gpt - fine-tuning - code-llm - programming - dataset - binary-search - dynamic-programming - machine-learning - deep-learning - nlp pretty_name: AnupamB-Coder-Dataset size_categories: - 1M<n<10M task_categories: - text-generation multilinguality: - monolingual source_datasets: - original annotations_creators: - machine-generated language_creators: - machine-generated --- # AnupamB-Coder-Dataset A large-scale synthetic dataset of Python and SQL examples spanning basic to expert difficulty — purpose-built for training [AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M), a GPT-style code language model built entirely from scratch on a gaming laptop. --- ## The Story Behind This Dataset Most code datasets on HuggingFace come from scraping GitHub or StackOverflow. This one is different. Every single example in this dataset was **generated by a pure Python template engine** — no GPT, no API calls, no scraping. Just carefully designed generators that combine vocabulary, logic patterns, and difficulty levels to produce clean, structured, well-commented Python and SQL examples. The goal was simple: create a dataset where every example teaches something — from printing a multiplication table to implementing Dijkstra's algorithm to writing recursive CTEs with sessionization logic. This dataset was the foundation for Stage 5 and Stage 6 fine-tuning of AnupamB-Coder-110M. --- ## Dataset Stats | Language | Chunks | Examples | Size | |----------|--------|--------------|-----------| | Python | 40 | 4,000,000 | ~2.76 GB | | SQL | 20 | 2,000,000 | ~1.16 GB | | **Total**| **60** | **6,000,000**| **~3.9 GB** | > Each chunk contains exactly 100,000 examples. > Every chunk was generated with a unique random seed — > guaranteed fresh examples across all chunks. --- ## Difficulty Distribution Each example is explicitly tagged with a difficulty level inside the text itself. | Level | Python Coverage | SQL Coverage | Weight | |---------------|--------------------------------------------|-------------------------------------------|--------| | **basic** | arithmetic, strings, lists, dicts, loops | SELECT, INSERT, UPDATE, DELETE, WHERE | 30% | | **intermediate** | binary search, merge sort, quicksort, two sum, sliding window, DP, heap, LRU cache, linked list, OOP, functional | GROUP BY, HAVING, JOINs, subqueries, window functions, CASE WHEN, running totals | 35% | | **advanced** | graphs, Dijkstra, design patterns, concurrency, generators, coroutines | CTEs, period-over-period growth, PIVOT, anomaly detection, consecutive streaks, median | 25% | | **expert** | trees, knapsack, LCS, LIS, metaclasses, descriptors, event systems | recursive CTEs, date spines, funnel analysis, sessionization, hierarchy trees | 10% | --- ## Python Topics Covered ### Basic (30%) - Arithmetic operations with type validation - String manipulation — reverse, palindrome, anagram, case conversion - List operations — sort, search, filter, flatten, chunk, rotate - Conditional logic — prime check, perfect square, power of two - Loop patterns — multiplication tables, sieve of Eratosthenes, digit sum - Dictionary operations — frequency count, merge, invert, group by ### Intermediate (35%) - **Search**: binary search (leftmost, rightmost, rotated array) - **Sorting**: merge sort with inversion counting, quicksort with median-of-three pivot - **Arrays**: two sum, three sum, sliding window (fixed and variable size) - **Dynamic Programming**: coin change, climbing stairs, Kadane's algorithm, house robber - **Heap**: k-largest, kth-largest, merge k sorted lists - **OOP**: MinStack with O(1) get_min, LRU cache, linked list with cycle detection - **Functional**: compose, pipe, memoize, group_by, partition, flat_map - **Error handling**: custom exception hierarchy, retry decorator, safe execute ### Advanced (25%) - **Graphs**: BFS, DFS, Dijkstra, cycle detection, topological sort - **Design Patterns**: Singleton, Factory, Strategy, Observer, Event Bus - **Concurrency**: thread-safe counter, worker pool, parallel map, rate limiter - **Generators**: infinite counter, Fibonacci, prime generator, batched, pipeline ### Expert (10%) - **Trees**: from_list builder, diameter, LCA, serialize/deserialize, zigzag traversal, right side view - **DP**: 0/1 Knapsack with reconstruction, LCS with backtracking, LIS with patience sorting - **Metaprogramming**: auto-registering metaclass, typed descriptor, context manager, plugin registry --- ## SQL Topics Covered ### Basic (30%) - SELECT with WHERE, LIKE, BETWEEN, IS NULL - INSERT, UPDATE, DELETE - ORDER BY, LIMIT, DISTINCT - COUNT, CREATE TABLE ### Intermediate (35%) - GROUP BY with HAVING - INNER JOIN, LEFT JOIN, RIGHT JOIN - Subqueries (correlated and non-correlated) - Window functions: ROW_NUMBER, RANK, DENSE_RANK, NTILE - Running totals and moving averages - Monthly summaries with DATE_FORMAT - CASE WHEN categorization ### Advanced (25%) - Common Table Expressions (CTEs) - Period-over-period growth with LAG - PIVOT using conditional aggregation - Z-score anomaly detection with CROSS JOIN stats - Consecutive date streak analysis - Median calculation with ROW_NUMBER ### Expert (10%) - Recursive CTEs with date spine and gap filling - Funnel analysis with stage conversion rates - Hierarchical tree traversal (employee-manager) - Sessionization with 30-minute gap threshold --- ## 8 SQL Schemas Used All SQL examples are grounded in one of these realistic schemas: | Table | Key Columns | |----------------|----------------------------------------------------| | `users` | id, name, email, age, city, salary, department | | `orders` | id, user_id, amount, status, order_date, region | | `products` | id, name, category, price, stock, rating | | `employees` | id, name, department, salary, manager_id | | `sales` | id, employee_id, revenue, sale_date, channel | | `customers` | id, name, loyalty_points, total_spent, segment | | `transactions` | id, account_id, amount, type, balance | | `inventory` | id, product_id, quantity, reorder_level, unit_cost | --- ## Data Format Every example is a JSON object with a single `text` field. ### Python example format ```json { "text": "### Instruction\n# Task : Implement binary search on a sorted list\n# Level : intermediate\n# Solution:\ndef binary_search(arr: list, target: int) -> int:\n ..." } ``` ### SQL example format ```json { "text": "### SQL Query\n-- Question : Calculate running total of revenue grouped by region\n-- Level : advanced\n-- Context : CREATE TABLE sales (...)\n-- Answer :\nWITH ..." } ``` --- ## How It Was Generated No API was called. No GPT was used. No scraping happened. The dataset was generated by `generate_synthetic_dataset.py` — a pure Python script with: - **16 Python generator functions** covering all difficulty levels - **4 SQL generator functions** (basic, intermediate, advanced, expert) - **8 SQL table schemas** with realistic column definitions - **Weighted random selection** across difficulty levels - **MD5-based deduplication** within each chunk - **Unique random seed per chunk** — fresh examples every run The script runs entirely locally on CPU and generates approximately 600,000 Python examples per hour. --- ## How to Load ```python from datasets import load_dataset # Load all Python data ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/*.jsonl", split="train" ) # Load all SQL data ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/sql/*.jsonl", split="train" ) # Load everything ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/**/*.jsonl", split="train" ) # Load a single chunk ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/chunk_000.jsonl", split="train" ) # Streaming mode for large scale ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/*.jsonl", split="train", streaming=True ) print(ds[0]["text"]) ``` --- ## Training Usage Example ```python from datasets import load_dataset from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "anupambayen/AnupamB-Coder-110M" ) ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/chunk_000.jsonl", split="train" ) def tokenize(example): return tokenizer( example["text"], truncation=True, max_length=1024, ) tokenized = ds.map(tokenize, batched=True) ``` --- ## The Model Trained on This Dataset **[AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M)** | Property | Value | |-------------------|------------------------------------| | Parameters | 110 million | | Architecture | GPT decoder-only transformer | | Layers | 12 | | Embedding dim | 768 | | Attention heads | 12 | | Context length | 1024 tokens | | Vocabulary | 32,000 custom BPE tokens | | Training hardware | RTX 4060 Laptop 8GB | | Training time | 45+ days across 6 stages | --- ## About the Author **Anupam Bayen** — built AnupamB-Coder-110M entirely from scratch on a gaming laptop as a learning project to deeply understand how large language models work at every level: data pipeline, tokenizer, model architecture, training loop, and deployment. - GitHub: [github.com/anupambayen2/AnupamB-Coder](https://github.com/anupambayen2/AnupamB-Coder) - Model: [anupambayen/AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M) --- ## License **MIT License** — completely free to use, modify, distribute, and train models on. Commercial use is permitted. --- ## Citation ```bibtex @dataset{anupambayen_coder_dataset_2026, author = {Anupam Bayen}, title = {AnupamB-Coder-Dataset: Synthetic Python and SQL Examples from Basic to Expert}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/anupambayen/AnupamB-Coder-Dataset}, note = {Generated using pure Python template engine. No API or GPT used in generation.} } ``` --- *Built with patience on a gaming laptop. Every line of this dataset was generated locally — no cloud, no API, no budget. Just code and time.*

license: MIT许可证 language: - 英语 tags: - 代码 - Python - SQL - 合成数据集 - 指令微调（instruction-tuning） - 代码生成（code-generation） - 文本转SQL（text-to-sql） - 算法 - 数据结构 - 大语言模型（Large Language Model, LLM） - GPT - 微调 - 代码大语言模型（code-llm） - 编程 - 数据集 - 二分查找（Binary Search） - 动态规划（Dynamic Programming） - 机器学习（Machine Learning） - 深度学习（Deep Learning） - 自然语言处理（Natural Language Processing, NLP） pretty_name: AnupamB-Coder-Dataset size_categories: - 100万<样本数<1000万 task_categories: - 文本生成 multilinguality: - 单语言 source_datasets: - 原创 annotations_creators: - 机器生成 language_creators: - 机器生成 --- # AnupamB-Coder-Dataset 这是一套涵盖基础至专家级难度的Python与SQL示例大规模合成数据集，专为在游戏本上从零搭建的GPT架构代码大语言模型AnupamB-Coder-110M训练而构建。 --- ## 数据集背后的故事当前HuggingFace平台上的多数代码数据集均来自GitHub或StackOverflow的爬取，本数据集则截然不同。本数据集内的每一条示例均由纯Python模板引擎生成，未使用GPT、未调用任何API、未进行任何爬取操作。仅通过精心设计的生成器，结合词汇库、逻辑模式与难度等级，生成格式规范、结构清晰且带有完整注释的Python与SQL示例。本次构建的目标十分明确：打造一个每条示例都承载教学意义的数据集，覆盖从打印乘法表、实现迪杰斯特拉算法（Dijkstra's Algorithm），到编写带有会话化逻辑的递归通用表表达式（Common Table Expression, CTE）等各类内容。本数据集是AnupamB-Coder-110M第五阶段与第六阶段微调的核心训练数据。 --- ## 数据集统计数据 | 语言 | 数据块数 | 示例数量 | 总大小 | |--------|----------|-------------|-----------| | Python | 40 | 4,000,000 | ~2.76 GB | | SQL | 20 | 2,000,000 | ~1.16 GB | | **总计** | **60** | **6,000,000** | **~3.9 GB** | > 每个数据块恰好包含100,000条示例。每个数据块均使用唯一随机种子生成，确保所有数据块内的示例均为全新样本。 --- ## 难度分布每条示例均在文本中明确标注了难度等级。 | 难度等级 | Python覆盖主题 | SQL覆盖主题 | 权重占比 | |------------|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------|----------| | **基础级** | 算术运算、字符串、列表、字典、循环 | SELECT、INSERT、UPDATE、DELETE、WHERE | 30% | | **中等级** | 二分查找、归并排序、快速排序、两数之和、滑动窗口、动态规划、堆、LRU缓存、链表、面向对象编程、函数式编程 | GROUP BY、HAVING、连接查询、子查询、窗口函数、CASE WHEN、累计求和 | 35% | | **高级** | 图、迪杰斯特拉算法、设计模式、并发、生成器、协程 | 通用表表达式（CTE）、环比增长、透视表、异常检测、连续序列分析、中位数计算 | 25% | | **专家级** | 树、背包问题、最长公共子序列、最长递增子序列、元类、描述符、事件系统 | 递归通用表表达式、日期序列表、漏斗分析、会话化逻辑、层级树查询 | 10% | --- ## Python覆盖主题 ### 基础级（30%） - 带类型校验的算术运算 - 字符串操作：反转、回文、变位词、大小写转换 - 列表操作：排序、查找、过滤、扁平化、分块、旋转 - 条件逻辑：质数校验、完全平方数校验、2的幂次校验 - 循环模式：乘法表、埃拉托斯特尼筛法、数位求和 - 字典操作：频率统计、合并、反转、分组 ### 中等级（35%） - **搜索**：二分查找（左边界、右边界、旋转数组场景） - **排序**：带逆序计数的归并排序、基于三中位数枢轴的快速排序 - **数组**：两数之和、三数之和、固定/可变长度滑动窗口 - **动态规划**：硬币兑换、爬楼梯、卡登算法、打家劫舍 - **堆**：第k大元素、合并k个有序列表 - **面向对象编程**：支持O(1)取最小值的最小栈、LRU缓存、带环检测的链表 - **函数式编程**：函数组合、管道、记忆化、分组、分区、扁平化映射 - **错误处理**：自定义异常层级、重试装饰器、安全执行器 ### 高级（25%） - **图论**：广度优先搜索、深度优先搜索、迪杰斯特拉算法、环检测、拓扑排序 - **设计模式**：单例模式、工厂模式、策略模式、观察者模式、事件总线 - **并发**：线程安全计数器、工作线程池、并行映射、限流工具 - **生成器**：无限计数器、斐波那契数列、质数生成器、批处理、流水线 ### 专家级（10%） - **树结构**：从列表构建树、直径计算、最近公共祖先、序列化/反序列化、锯齿形遍历、右视图 - **动态规划**：带回溯的0-1背包问题、带回溯的最长公共子序列、基于耐心排序的最长递增子序列 - **元编程**：自动注册元类、类型化描述符、上下文管理器、插件注册表 --- ## SQL覆盖主题 ### 基础级（30%） - 带WHERE、LIKE、BETWEEN、IS NULL的SELECT语句 - INSERT、UPDATE、DELETE操作 - ORDER BY、LIMIT、DISTINCT - COUNT、CREATE TABLE ### 中等级（35%） - 带HAVING的GROUP BY - 内连接、左连接、右连接 - 子查询（关联子查询与非关联子查询） - 窗口函数：ROW_NUMBER、RANK、DENSE_RANK、NTILE - 累计求和与移动平均 - 基于DATE_FORMAT的月度统计 - CASE WHEN分类逻辑 ### 高级（25%） - 通用表表达式（CTE） - 基于LAG的环比增长计算 - 基于条件聚合的透视表 - 结合CROSS JOIN统计的Z-score异常检测 - 连续日期序列分析 - 基于ROW_NUMBER的中位数计算 ### 专家级（10%） - 带日期序列与间隙填充的递归通用表表达式 - 基于阶段转化率的漏斗分析 - 层级树遍历（员工-经理场景） - 基于30分钟间隙阈值的会话化逻辑 --- ## 所用的8种SQL表结构所有SQL示例均基于以下8种真实场景的表结构构建： | 表名 | 关键字段 | |---------------|--------------------------------------------------------------------------| | `users` | id、name、email、age、city、salary、department | | `orders` | id、user_id、amount、status、order_date、region | | `products` | id、name、category、price、stock、rating | | `employees` | id、name、department、salary、manager_id | | `sales` | id、employee_id、revenue、sale_date、channel | | `customers` | id、name、loyalty_points、total_spent、segment | | `transactions`| id、account_id、amount、type、balance | | `inventory` | id、product_id、quantity、reorder_level、unit_cost | --- ## 数据格式每条示例均为仅包含`text`字段的JSON对象。 ### Python示例格式 json { "text": "### 指令 # 任务 : 对有序列表实现二分查找 # 难度等级 : 中等级 # 解决方案: def binary_search(arr: list, target: int) -> int: ..." } ### SQL示例格式 json { "text": "### SQL查询 -- 问题 : 计算按地区分组的营收累计求和 -- 难度 : 高级 -- 上下文 : CREATE TABLE sales (...) -- 答案 : WITH ..." } --- ## 数据集生成方式未调用任何API，未使用GPT，未进行任何爬取操作。本数据集由`generate_synthetic_dataset.py`生成，该纯Python脚本包含： - 覆盖所有难度等级的16个Python生成器函数 - 4个SQL生成器函数（基础、中级、高级、专家级） - 8种带有真实字段定义的SQL表结构 - 基于难度等级的加权随机采样机制 - 每个数据块内基于MD5的去重逻辑 - 每个数据块使用唯一随机种子，确保每次运行生成全新样本该脚本完全在本地CPU上运行，每小时可生成约600,000条Python示例。 --- ## 数据集加载方法 python from datasets import load_dataset # 加载全部Python数据 ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/*.jsonl", split="train" ) # 加载全部SQL数据 ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/sql/*.jsonl", split="train" ) # 加载全部数据 ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/**/*.jsonl", split="train" ) # 加载单个数据块 ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/chunk_000.jsonl", split="train" ) # 大规模流式加载 ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/*.jsonl", split="train", streaming=True ) print(ds[0]["text"]) --- ## 训练使用示例 python from datasets import load_dataset from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "anupambayen/AnupamB-Coder-110M" ) ds = load_dataset( "anupambayen/AnupamB-Coder-Dataset", data_files="data/python/chunk_000.jsonl", split="train" ) def tokenize(example): return tokenizer( example["text"], truncation=True, max_length=1024, ) tokenized = ds.map(tokenize, batched=True) --- ## 基于本数据集训练的模型 **[AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M)** | 属性 | 数值 | |--------------------|---------------------------------------| | 参数数量 | 1.1亿 | | 架构类型 | 仅解码器GPT架构Transformer | | 层数 | 12 | | 嵌入维度 | 768 | | 注意力头数 | 12 | | 上下文长度 | 1024个Token | | 词表规模 | 32000个自定义字节对编码（Byte Pair Encoding, BPE）令牌 | | 训练硬件 | RTX 4060 笔记本电脑 8GB | | 总训练时长 | 累计45天以上，共6个训练阶段 | --- ## 关于作者 **阿努帕姆·巴延（Anupam Bayen）** —— 为深入理解大语言模型从数据流水线、分词器、模型架构、训练循环到部署的全流程，以学习为目的在游戏本上从零搭建了AnupamB-Coder-110M。 - GitHub仓库：[github.com/anupambayen2/AnupamB-Coder](https://github.com/anupambayen2/AnupamB-Coder) - 模型地址：[anupambayen/AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M) --- ## 许可证 **MIT许可证** —— 可免费使用、修改、分发并用于模型训练，允许商业用途。 --- ## 引用格式 bibtex @dataset{anupambayen_coder_dataset_2026, author = {Anupam Bayen}, title = {AnupamB-Coder-Dataset: Synthetic Python and SQL Examples from Basic to Expert}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/anupambayen/AnupamB-Coder-Dataset}, note = {使用纯Python模板引擎生成，未使用API或GPT。} } --- *在游戏本上凭借耐心搭建而成。本数据集的每一行均为本地生成——无需云服务、无需API、无需预算，仅靠代码与时间。*

提供机构：

anupambayen

5,000+

优质数据集

54 个

任务类型

进入经典数据集