KoinicLabs/AXL-DATASET-1
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/KoinicLabs/AXL-DATASET-1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- code
tags:
- code-generation
- training-data
- python
- multi-scale-transformer
- cpu-optimized
- koinic
task_categories:
- text-generation
- text2text-generation
pretty_name: AXL Training Data
size_categories:
- 100M<n<1B
---
# AXL Training Data
Training datasets for the [AXL](https://github.com/Koinic/AXL) multi-scale transformer model family by [Koinic](https://koinic.ai).
## Dataset Structure
```
code/ # Raw Python code for language modeling
├── axl_hf_evol_codealpaca_v1.txt (239 MB) Real Python from HuggingFace
├── axl_hf_Evol_Instruct_Code_80k_v1.txt (113 MB) Evol-Instruct code data
├── axl_python_code_5gb.txt (58 MB) Python code corpus
├── axl_python_code_hf.txt (5.5 MB) Python code
├── axl_python_code.txt (2.1 MB) Python code
└── axl_power_hour.txt (977 KB) Misc code
pairs/ # Input-output training pairs
├── comment/ # Code → commented code
│ ├── axl_comment_pairs.txt (17 MB)
│ └── axl_comment_pairs_expanded.txt (20 MB)
├── docs/ # Code → documented code
│ ├── axl_docs_pairs.txt (20 MB)
│ └── axl_docs_pairs_expanded.txt (16 MB)
├── testgen/ # Function → test cases
│ ├── axl_testgen_pairs.txt (12 MB)
│ └── axl_testgen_pairs_expanded.txt (11 MB)
├── refactor/ # Bad code → refactored code
│ └── axl_refactor_pairs.txt (7.2 MB)
├── secure/ # Vulnerable → secure code
│ └── axl_secure_pairs.txt (5.7 MB)
├── chat/ # User query → assistant response
│ └── axl_chat_pairs.txt (9.6 MB)
└── translation/ # Code translation pairs
├── translation_all_pairs.jsonl (2.6 MB)
└── generate_translation_data.py (14 KB)
demo/ # Demo / testing data
└── shakespeare.txt (1.1 MB)
```
## Usage
### Download with Python
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Koinic/axl-training-data", repo_type="dataset", local_dir="data/")
```
### Download with CLI
```bash
huggingface-cli download Koinic/axl-training-data --repo-type dataset
```
### Use with AXL Training
```bash
# Train using downloaded data
python scripts/train_axl_micro.py --data_path data/code/axl_hf_evol_codealpaca_v1.txt --max_time 600
# Or train with pairs
python scripts/train_axl_micro.py --data_path data/pairs/comment/axl_comment_pairs_expanded.txt --max_time 600
```
### Generate More Data
```bash
# Generate all synthetic training data
python scripts/generate_all_training_data.py --skip-hf
# Generate translation pairs
python data/pairs/translation/generate_translation_data.py
```
## Data Sources
| Source | Type | License |
|--------|------|---------|
| `axl_hf_evol_codealpaca_v1.txt` | HuggingFace (bigcode/starcoderdata) | OpenRAIL-M |
| `axl_hf_Evol_Instruct_Code_80k_v1.txt` | HuggingFace (sahil2801/CodeAlpaca-20k) | Apache-2.0 |
| `*_pairs*.txt` | Synthetic (generated by AXL scripts) | Apache-2.0 |
| `axl_python_code*.txt` | Curated Python code | Apache-2.0 |
| `shakespeare.txt` | Public domain | Public domain |
## Citation
```bibtex
@misc{axl_2026,
title={AXL: Multi-Scale Agentic Transformer for CPU-Optimized Code Generation},
author={Koinic},
year={2026},
url={https://github.com/Koinic/AXL}
}
```
## License
Apache-2.0
提供机构:
KoinicLabs



