five

AxiomicLabs/NPset-python

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AxiomicLabs/NPset-python
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - sentence-similarity language: - en tags: - code pretty_name: NPset-Python size_categories: - 1M<n<10M --- ![Axiomic Banner](AxiomicBanner.png) # NPset A normalized semi-sythetic Python dataset for training small language models on code logic without the overhead of raw code syntax. ![Tokenizer chart](tokenizer_chart.png) ## Why Small language models trained on natural language corpora develop latent representations of logical constructs -- iteration, conditionals, data flow, function composition -- yet struggle to apply this reasoning to source code, where syntactic overhead (delimiters, indentation conventions, language-specific idioms) occupies a disproportionate share of the token budget, requires a vocabulary of code-specific tokens rarely encountered during pretraining, and introduces a surface-form distribution shift relative to the model's prior knowledge. NPset addresses this by normalizing Python source through an AST-based converter that strips syntactic noise while preserving the full logical structure of each program, producing a pseudocode representation composed entirely of natural language tokens that aligns more directly with the semantic representations already present in small models, allowing them to reason about what code *does* rather than expending capacity learning what it *looks like*. ## Format Parquet, shuffled. Each row: | Field | Type | Description | |---|---|---| | `code` | string | Normalized pseudocode | | `original_code` | string | Original Python source | | `original_language` | string | Always `Python` | | `source` | string | Origin dataset identifier | ## Sources | Source | Dataset | Rows | |---|---|---:| | `nomic_cornstack_python_v1` | nomic-ai/cornstack-python-v1 | 3,498,845 | | `zaydzuhri_stack_edu_python` | zaydzuhri/stack-edu-python (`license_type=no_license`) | 3,543,752 | | `jtatman_500k` | jtatman/python-code-dataset-500k | 32,590 | | `iamtarun_python_18k_alpaca` | iamtarun/python_code_instructions_18k_alpaca | 17,496 | | `flytech_python_25k` | flytech/python-codes-25k | 42,968 | | `dbands_pythonMath` | dbands/pythonMath | 5,726 | | `greatdarklord_python_dataset` | greatdarklord/python_dataset | 18,452 | | | **Total** | **7,159,829** |
提供机构:
AxiomicLabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作