anupambayen/AnupamB-Coder-Dataset
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/anupambayen/AnupamB-Coder-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- code
- python
- sql
- synthetic
- instruction-tuning
- code-generation
- text-to-sql
- algorithms
- data-structures
- llm
- gpt
- fine-tuning
- code-llm
- programming
- dataset
- binary-search
- dynamic-programming
- machine-learning
- deep-learning
- nlp
pretty_name: AnupamB-Coder-Dataset
size_categories:
- 1M<n<10M
task_categories:
- text-generation
multilinguality:
- monolingual
source_datasets:
- original
annotations_creators:
- machine-generated
language_creators:
- machine-generated
---
# AnupamB-Coder-Dataset
A large-scale synthetic dataset of Python and SQL examples
spanning basic to expert difficulty — purpose-built for
training [AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M),
a GPT-style code language model built entirely from scratch
on a gaming laptop.
---
## The Story Behind This Dataset
Most code datasets on HuggingFace come from scraping GitHub
or StackOverflow. This one is different.
Every single example in this dataset was **generated by a
pure Python template engine** — no GPT, no API calls, no
scraping. Just carefully designed generators that combine
vocabulary, logic patterns, and difficulty levels to produce
clean, structured, well-commented Python and SQL examples.
The goal was simple: create a dataset where every example
teaches something — from printing a multiplication table
to implementing Dijkstra's algorithm to writing recursive
CTEs with sessionization logic.
This dataset was the foundation for Stage 5 and Stage 6
fine-tuning of AnupamB-Coder-110M.
---
## Dataset Stats
| Language | Chunks | Examples | Size |
|----------|--------|--------------|-----------|
| Python | 40 | 4,000,000 | ~2.76 GB |
| SQL | 20 | 2,000,000 | ~1.16 GB |
| **Total**| **60** | **6,000,000**| **~3.9 GB** |
> Each chunk contains exactly 100,000 examples.
> Every chunk was generated with a unique random seed —
> guaranteed fresh examples across all chunks.
---
## Difficulty Distribution
Each example is explicitly tagged with a difficulty level
inside the text itself.
| Level | Python Coverage | SQL Coverage | Weight |
|---------------|--------------------------------------------|-------------------------------------------|--------|
| **basic** | arithmetic, strings, lists, dicts, loops | SELECT, INSERT, UPDATE, DELETE, WHERE | 30% |
| **intermediate** | binary search, merge sort, quicksort, two sum, sliding window, DP, heap, LRU cache, linked list, OOP, functional | GROUP BY, HAVING, JOINs, subqueries, window functions, CASE WHEN, running totals | 35% |
| **advanced** | graphs, Dijkstra, design patterns, concurrency, generators, coroutines | CTEs, period-over-period growth, PIVOT, anomaly detection, consecutive streaks, median | 25% |
| **expert** | trees, knapsack, LCS, LIS, metaclasses, descriptors, event systems | recursive CTEs, date spines, funnel analysis, sessionization, hierarchy trees | 10% |
---
## Python Topics Covered
### Basic (30%)
- Arithmetic operations with type validation
- String manipulation — reverse, palindrome, anagram, case conversion
- List operations — sort, search, filter, flatten, chunk, rotate
- Conditional logic — prime check, perfect square, power of two
- Loop patterns — multiplication tables, sieve of Eratosthenes, digit sum
- Dictionary operations — frequency count, merge, invert, group by
### Intermediate (35%)
- **Search**: binary search (leftmost, rightmost, rotated array)
- **Sorting**: merge sort with inversion counting, quicksort with median-of-three pivot
- **Arrays**: two sum, three sum, sliding window (fixed and variable size)
- **Dynamic Programming**: coin change, climbing stairs, Kadane's algorithm, house robber
- **Heap**: k-largest, kth-largest, merge k sorted lists
- **OOP**: MinStack with O(1) get_min, LRU cache, linked list with cycle detection
- **Functional**: compose, pipe, memoize, group_by, partition, flat_map
- **Error handling**: custom exception hierarchy, retry decorator, safe execute
### Advanced (25%)
- **Graphs**: BFS, DFS, Dijkstra, cycle detection, topological sort
- **Design Patterns**: Singleton, Factory, Strategy, Observer, Event Bus
- **Concurrency**: thread-safe counter, worker pool, parallel map, rate limiter
- **Generators**: infinite counter, Fibonacci, prime generator, batched, pipeline
### Expert (10%)
- **Trees**: from_list builder, diameter, LCA, serialize/deserialize, zigzag traversal, right side view
- **DP**: 0/1 Knapsack with reconstruction, LCS with backtracking, LIS with patience sorting
- **Metaprogramming**: auto-registering metaclass, typed descriptor, context manager, plugin registry
---
## SQL Topics Covered
### Basic (30%)
- SELECT with WHERE, LIKE, BETWEEN, IS NULL
- INSERT, UPDATE, DELETE
- ORDER BY, LIMIT, DISTINCT
- COUNT, CREATE TABLE
### Intermediate (35%)
- GROUP BY with HAVING
- INNER JOIN, LEFT JOIN, RIGHT JOIN
- Subqueries (correlated and non-correlated)
- Window functions: ROW_NUMBER, RANK, DENSE_RANK, NTILE
- Running totals and moving averages
- Monthly summaries with DATE_FORMAT
- CASE WHEN categorization
### Advanced (25%)
- Common Table Expressions (CTEs)
- Period-over-period growth with LAG
- PIVOT using conditional aggregation
- Z-score anomaly detection with CROSS JOIN stats
- Consecutive date streak analysis
- Median calculation with ROW_NUMBER
### Expert (10%)
- Recursive CTEs with date spine and gap filling
- Funnel analysis with stage conversion rates
- Hierarchical tree traversal (employee-manager)
- Sessionization with 30-minute gap threshold
---
## 8 SQL Schemas Used
All SQL examples are grounded in one of these realistic schemas:
| Table | Key Columns |
|----------------|----------------------------------------------------|
| `users` | id, name, email, age, city, salary, department |
| `orders` | id, user_id, amount, status, order_date, region |
| `products` | id, name, category, price, stock, rating |
| `employees` | id, name, department, salary, manager_id |
| `sales` | id, employee_id, revenue, sale_date, channel |
| `customers` | id, name, loyalty_points, total_spent, segment |
| `transactions` | id, account_id, amount, type, balance |
| `inventory` | id, product_id, quantity, reorder_level, unit_cost |
---
## Data Format
Every example is a JSON object with a single `text` field.
### Python example format
```json
{
"text": "### Instruction\n# Task : Implement binary search on a sorted list\n# Level : intermediate\n# Solution:\ndef binary_search(arr: list, target: int) -> int:\n ..."
}
```
### SQL example format
```json
{
"text": "### SQL Query\n-- Question : Calculate running total of revenue grouped by region\n-- Level : advanced\n-- Context : CREATE TABLE sales (...)\n-- Answer :\nWITH ..."
}
```
---
## How It Was Generated
No API was called. No GPT was used. No scraping happened.
The dataset was generated by `generate_synthetic_dataset.py`
— a pure Python script with:
- **16 Python generator functions** covering all difficulty levels
- **4 SQL generator functions** (basic, intermediate, advanced, expert)
- **8 SQL table schemas** with realistic column definitions
- **Weighted random selection** across difficulty levels
- **MD5-based deduplication** within each chunk
- **Unique random seed per chunk** — fresh examples every run
The script runs entirely locally on CPU and generates
approximately 600,000 Python examples per hour.
---
## How to Load
```python
from datasets import load_dataset
# Load all Python data
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/*.jsonl",
split="train"
)
# Load all SQL data
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/sql/*.jsonl",
split="train"
)
# Load everything
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/**/*.jsonl",
split="train"
)
# Load a single chunk
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/chunk_000.jsonl",
split="train"
)
# Streaming mode for large scale
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/*.jsonl",
split="train",
streaming=True
)
print(ds[0]["text"])
```
---
## Training Usage Example
```python
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"anupambayen/AnupamB-Coder-110M"
)
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/chunk_000.jsonl",
split="train"
)
def tokenize(example):
return tokenizer(
example["text"],
truncation=True,
max_length=1024,
)
tokenized = ds.map(tokenize, batched=True)
```
---
## The Model Trained on This Dataset
**[AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M)**
| Property | Value |
|-------------------|------------------------------------|
| Parameters | 110 million |
| Architecture | GPT decoder-only transformer |
| Layers | 12 |
| Embedding dim | 768 |
| Attention heads | 12 |
| Context length | 1024 tokens |
| Vocabulary | 32,000 custom BPE tokens |
| Training hardware | RTX 4060 Laptop 8GB |
| Training time | 45+ days across 6 stages |
---
## About the Author
**Anupam Bayen** — built AnupamB-Coder-110M entirely from
scratch on a gaming laptop as a learning project to deeply
understand how large language models work at every level:
data pipeline, tokenizer, model architecture, training loop,
and deployment.
- GitHub: [github.com/anupambayen2/AnupamB-Coder](https://github.com/anupambayen2/AnupamB-Coder)
- Model: [anupambayen/AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M)
---
## License
**MIT License** — completely free to use, modify, distribute,
and train models on. Commercial use is permitted.
---
## Citation
```bibtex
@dataset{anupambayen_coder_dataset_2026,
author = {Anupam Bayen},
title = {AnupamB-Coder-Dataset: Synthetic Python and SQL
Examples from Basic to Expert},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/anupambayen/AnupamB-Coder-Dataset},
note = {Generated using pure Python template engine.
No API or GPT used in generation.}
}
```
---
*Built with patience on a gaming laptop.
Every line of this dataset was generated locally — no cloud,
no API, no budget. Just code and time.*
license: MIT许可证
language:
- 英语
tags:
- 代码
- Python
- SQL
- 合成数据集
- 指令微调(instruction-tuning)
- 代码生成(code-generation)
- 文本转SQL(text-to-sql)
- 算法
- 数据结构
- 大语言模型(Large Language Model, LLM)
- GPT
- 微调
- 代码大语言模型(code-llm)
- 编程
- 数据集
- 二分查找(Binary Search)
- 动态规划(Dynamic Programming)
- 机器学习(Machine Learning)
- 深度学习(Deep Learning)
- 自然语言处理(Natural Language Processing, NLP)
pretty_name: AnupamB-Coder-Dataset
size_categories:
- 100万<样本数<1000万
task_categories:
- 文本生成
multilinguality:
- 单语言
source_datasets:
- 原创
annotations_creators:
- 机器生成
language_creators:
- 机器生成
---
# AnupamB-Coder-Dataset
这是一套涵盖基础至专家级难度的Python与SQL示例大规模合成数据集,专为在游戏本上从零搭建的GPT架构代码大语言模型AnupamB-Coder-110M训练而构建。
---
## 数据集背后的故事
当前HuggingFace平台上的多数代码数据集均来自GitHub或StackOverflow的爬取,本数据集则截然不同。
本数据集内的每一条示例均由纯Python模板引擎生成,未使用GPT、未调用任何API、未进行任何爬取操作。仅通过精心设计的生成器,结合词汇库、逻辑模式与难度等级,生成格式规范、结构清晰且带有完整注释的Python与SQL示例。
本次构建的目标十分明确:打造一个每条示例都承载教学意义的数据集,覆盖从打印乘法表、实现迪杰斯特拉算法(Dijkstra's Algorithm),到编写带有会话化逻辑的递归通用表表达式(Common Table Expression, CTE)等各类内容。
本数据集是AnupamB-Coder-110M第五阶段与第六阶段微调的核心训练数据。
---
## 数据集统计数据
| 语言 | 数据块数 | 示例数量 | 总大小 |
|--------|----------|-------------|-----------|
| Python | 40 | 4,000,000 | ~2.76 GB |
| SQL | 20 | 2,000,000 | ~1.16 GB |
| **总计** | **60** | **6,000,000** | **~3.9 GB** |
> 每个数据块恰好包含100,000条示例。每个数据块均使用唯一随机种子生成,确保所有数据块内的示例均为全新样本。
---
## 难度分布
每条示例均在文本中明确标注了难度等级。
| 难度等级 | Python覆盖主题 | SQL覆盖主题 | 权重占比 |
|------------|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------|----------|
| **基础级** | 算术运算、字符串、列表、字典、循环 | SELECT、INSERT、UPDATE、DELETE、WHERE | 30% |
| **中等级** | 二分查找、归并排序、快速排序、两数之和、滑动窗口、动态规划、堆、LRU缓存、链表、面向对象编程、函数式编程 | GROUP BY、HAVING、连接查询、子查询、窗口函数、CASE WHEN、累计求和 | 35% |
| **高级** | 图、迪杰斯特拉算法、设计模式、并发、生成器、协程 | 通用表表达式(CTE)、环比增长、透视表、异常检测、连续序列分析、中位数计算 | 25% |
| **专家级** | 树、背包问题、最长公共子序列、最长递增子序列、元类、描述符、事件系统 | 递归通用表表达式、日期序列表、漏斗分析、会话化逻辑、层级树查询 | 10% |
---
## Python覆盖主题
### 基础级(30%)
- 带类型校验的算术运算
- 字符串操作:反转、回文、变位词、大小写转换
- 列表操作:排序、查找、过滤、扁平化、分块、旋转
- 条件逻辑:质数校验、完全平方数校验、2的幂次校验
- 循环模式:乘法表、埃拉托斯特尼筛法、数位求和
- 字典操作:频率统计、合并、反转、分组
### 中等级(35%)
- **搜索**:二分查找(左边界、右边界、旋转数组场景)
- **排序**:带逆序计数的归并排序、基于三中位数枢轴的快速排序
- **数组**:两数之和、三数之和、固定/可变长度滑动窗口
- **动态规划**:硬币兑换、爬楼梯、卡登算法、打家劫舍
- **堆**:第k大元素、合并k个有序列表
- **面向对象编程**:支持O(1)取最小值的最小栈、LRU缓存、带环检测的链表
- **函数式编程**:函数组合、管道、记忆化、分组、分区、扁平化映射
- **错误处理**:自定义异常层级、重试装饰器、安全执行器
### 高级(25%)
- **图论**:广度优先搜索、深度优先搜索、迪杰斯特拉算法、环检测、拓扑排序
- **设计模式**:单例模式、工厂模式、策略模式、观察者模式、事件总线
- **并发**:线程安全计数器、工作线程池、并行映射、限流工具
- **生成器**:无限计数器、斐波那契数列、质数生成器、批处理、流水线
### 专家级(10%)
- **树结构**:从列表构建树、直径计算、最近公共祖先、序列化/反序列化、锯齿形遍历、右视图
- **动态规划**:带回溯的0-1背包问题、带回溯的最长公共子序列、基于耐心排序的最长递增子序列
- **元编程**:自动注册元类、类型化描述符、上下文管理器、插件注册表
---
## SQL覆盖主题
### 基础级(30%)
- 带WHERE、LIKE、BETWEEN、IS NULL的SELECT语句
- INSERT、UPDATE、DELETE操作
- ORDER BY、LIMIT、DISTINCT
- COUNT、CREATE TABLE
### 中等级(35%)
- 带HAVING的GROUP BY
- 内连接、左连接、右连接
- 子查询(关联子查询与非关联子查询)
- 窗口函数:ROW_NUMBER、RANK、DENSE_RANK、NTILE
- 累计求和与移动平均
- 基于DATE_FORMAT的月度统计
- CASE WHEN分类逻辑
### 高级(25%)
- 通用表表达式(CTE)
- 基于LAG的环比增长计算
- 基于条件聚合的透视表
- 结合CROSS JOIN统计的Z-score异常检测
- 连续日期序列分析
- 基于ROW_NUMBER的中位数计算
### 专家级(10%)
- 带日期序列与间隙填充的递归通用表表达式
- 基于阶段转化率的漏斗分析
- 层级树遍历(员工-经理场景)
- 基于30分钟间隙阈值的会话化逻辑
---
## 所用的8种SQL表结构
所有SQL示例均基于以下8种真实场景的表结构构建:
| 表名 | 关键字段 |
|---------------|--------------------------------------------------------------------------|
| `users` | id、name、email、age、city、salary、department |
| `orders` | id、user_id、amount、status、order_date、region |
| `products` | id、name、category、price、stock、rating |
| `employees` | id、name、department、salary、manager_id |
| `sales` | id、employee_id、revenue、sale_date、channel |
| `customers` | id、name、loyalty_points、total_spent、segment |
| `transactions`| id、account_id、amount、type、balance |
| `inventory` | id、product_id、quantity、reorder_level、unit_cost |
---
## 数据格式
每条示例均为仅包含`text`字段的JSON对象。
### Python示例格式
json
{
"text": "### 指令
# 任务 : 对有序列表实现二分查找
# 难度等级 : 中等级
# 解决方案:
def binary_search(arr: list, target: int) -> int:
..."
}
### SQL示例格式
json
{
"text": "### SQL查询
-- 问题 : 计算按地区分组的营收累计求和
-- 难度 : 高级
-- 上下文 : CREATE TABLE sales (...)
-- 答案 :
WITH ..."
}
---
## 数据集生成方式
未调用任何API,未使用GPT,未进行任何爬取操作。
本数据集由`generate_synthetic_dataset.py`生成,该纯Python脚本包含:
- 覆盖所有难度等级的16个Python生成器函数
- 4个SQL生成器函数(基础、中级、高级、专家级)
- 8种带有真实字段定义的SQL表结构
- 基于难度等级的加权随机采样机制
- 每个数据块内基于MD5的去重逻辑
- 每个数据块使用唯一随机种子,确保每次运行生成全新样本
该脚本完全在本地CPU上运行,每小时可生成约600,000条Python示例。
---
## 数据集加载方法
python
from datasets import load_dataset
# 加载全部Python数据
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/*.jsonl",
split="train"
)
# 加载全部SQL数据
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/sql/*.jsonl",
split="train"
)
# 加载全部数据
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/**/*.jsonl",
split="train"
)
# 加载单个数据块
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/chunk_000.jsonl",
split="train"
)
# 大规模流式加载
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/*.jsonl",
split="train",
streaming=True
)
print(ds[0]["text"])
---
## 训练使用示例
python
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"anupambayen/AnupamB-Coder-110M"
)
ds = load_dataset(
"anupambayen/AnupamB-Coder-Dataset",
data_files="data/python/chunk_000.jsonl",
split="train"
)
def tokenize(example):
return tokenizer(
example["text"],
truncation=True,
max_length=1024,
)
tokenized = ds.map(tokenize, batched=True)
---
## 基于本数据集训练的模型
**[AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M)**
| 属性 | 数值 |
|--------------------|---------------------------------------|
| 参数数量 | 1.1亿 |
| 架构类型 | 仅解码器GPT架构Transformer |
| 层数 | 12 |
| 嵌入维度 | 768 |
| 注意力头数 | 12 |
| 上下文长度 | 1024个Token |
| 词表规模 | 32000个自定义字节对编码(Byte Pair Encoding, BPE)令牌 |
| 训练硬件 | RTX 4060 笔记本电脑 8GB |
| 总训练时长 | 累计45天以上,共6个训练阶段 |
---
## 关于作者
**阿努帕姆·巴延(Anupam Bayen)** —— 为深入理解大语言模型从数据流水线、分词器、模型架构、训练循环到部署的全流程,以学习为目的在游戏本上从零搭建了AnupamB-Coder-110M。
- GitHub仓库:[github.com/anupambayen2/AnupamB-Coder](https://github.com/anupambayen2/AnupamB-Coder)
- 模型地址:[anupambayen/AnupamB-Coder-110M](https://huggingface.co/anupambayen/AnupamB-Coder-110M)
---
## 许可证
**MIT许可证** —— 可免费使用、修改、分发并用于模型训练,允许商业用途。
---
## 引用格式
bibtex
@dataset{anupambayen_coder_dataset_2026,
author = {Anupam Bayen},
title = {AnupamB-Coder-Dataset: Synthetic Python and SQL
Examples from Basic to Expert},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/anupambayen/AnupamB-Coder-Dataset},
note = {使用纯Python模板引擎生成,未使用API或GPT。}
}
---
*在游戏本上凭借耐心搭建而成。本数据集的每一行均为本地生成——无需云服务、无需API、无需预算,仅靠代码与时间。*
提供机构:
anupambayen



