AxiomicLabs/NPset-python
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AxiomicLabs/NPset-python
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- sentence-similarity
language:
- en
tags:
- code
pretty_name: NPset-Python
size_categories:
- 1M<n<10M
---

# NPset
A normalized semi-sythetic Python dataset for training small language models on code logic without the overhead of raw code syntax.

## Why
Small language models trained on natural language corpora develop latent representations of logical constructs -- iteration, conditionals, data flow, function composition -- yet struggle to apply this reasoning to source code, where syntactic overhead (delimiters, indentation conventions, language-specific idioms) occupies a disproportionate share of the token budget, requires a vocabulary of code-specific tokens rarely encountered during pretraining, and introduces a surface-form distribution shift relative to the model's prior knowledge. NPset addresses this by normalizing Python source through an AST-based converter that strips syntactic noise while preserving the full logical structure of each program, producing a pseudocode representation composed entirely of natural language tokens that aligns more directly with the semantic representations already present in small models, allowing them to reason about what code *does* rather than expending capacity learning what it *looks like*.
## Format
Parquet, shuffled. Each row:
| Field | Type | Description |
|---|---|---|
| `code` | string | Normalized pseudocode |
| `original_code` | string | Original Python source |
| `original_language` | string | Always `Python` |
| `source` | string | Origin dataset identifier |
## Sources
| Source | Dataset | Rows |
|---|---|---:|
| `nomic_cornstack_python_v1` | nomic-ai/cornstack-python-v1 | 3,498,845 |
| `zaydzuhri_stack_edu_python` | zaydzuhri/stack-edu-python (`license_type=no_license`) | 3,543,752 |
| `jtatman_500k` | jtatman/python-code-dataset-500k | 32,590 |
| `iamtarun_python_18k_alpaca` | iamtarun/python_code_instructions_18k_alpaca | 17,496 |
| `flytech_python_25k` | flytech/python-codes-25k | 42,968 |
| `dbands_pythonMath` | dbands/pythonMath | 5,726 |
| `greatdarklord_python_dataset` | greatdarklord/python_dataset | 18,452 |
| | **Total** | **7,159,829** |
提供机构:
AxiomicLabs



