five

KoinicLabs/AXL-DATASET-1

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/KoinicLabs/AXL-DATASET-1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - code tags: - code-generation - training-data - python - multi-scale-transformer - cpu-optimized - koinic task_categories: - text-generation - text2text-generation pretty_name: AXL Training Data size_categories: - 100M<n<1B --- # AXL Training Data Training datasets for the [AXL](https://github.com/Koinic/AXL) multi-scale transformer model family by [Koinic](https://koinic.ai). ## Dataset Structure ``` code/ # Raw Python code for language modeling ├── axl_hf_evol_codealpaca_v1.txt (239 MB) Real Python from HuggingFace ├── axl_hf_Evol_Instruct_Code_80k_v1.txt (113 MB) Evol-Instruct code data ├── axl_python_code_5gb.txt (58 MB) Python code corpus ├── axl_python_code_hf.txt (5.5 MB) Python code ├── axl_python_code.txt (2.1 MB) Python code └── axl_power_hour.txt (977 KB) Misc code pairs/ # Input-output training pairs ├── comment/ # Code → commented code │ ├── axl_comment_pairs.txt (17 MB) │ └── axl_comment_pairs_expanded.txt (20 MB) ├── docs/ # Code → documented code │ ├── axl_docs_pairs.txt (20 MB) │ └── axl_docs_pairs_expanded.txt (16 MB) ├── testgen/ # Function → test cases │ ├── axl_testgen_pairs.txt (12 MB) │ └── axl_testgen_pairs_expanded.txt (11 MB) ├── refactor/ # Bad code → refactored code │ └── axl_refactor_pairs.txt (7.2 MB) ├── secure/ # Vulnerable → secure code │ └── axl_secure_pairs.txt (5.7 MB) ├── chat/ # User query → assistant response │ └── axl_chat_pairs.txt (9.6 MB) └── translation/ # Code translation pairs ├── translation_all_pairs.jsonl (2.6 MB) └── generate_translation_data.py (14 KB) demo/ # Demo / testing data └── shakespeare.txt (1.1 MB) ``` ## Usage ### Download with Python ```python from huggingface_hub import snapshot_download snapshot_download(repo_id="Koinic/axl-training-data", repo_type="dataset", local_dir="data/") ``` ### Download with CLI ```bash huggingface-cli download Koinic/axl-training-data --repo-type dataset ``` ### Use with AXL Training ```bash # Train using downloaded data python scripts/train_axl_micro.py --data_path data/code/axl_hf_evol_codealpaca_v1.txt --max_time 600 # Or train with pairs python scripts/train_axl_micro.py --data_path data/pairs/comment/axl_comment_pairs_expanded.txt --max_time 600 ``` ### Generate More Data ```bash # Generate all synthetic training data python scripts/generate_all_training_data.py --skip-hf # Generate translation pairs python data/pairs/translation/generate_translation_data.py ``` ## Data Sources | Source | Type | License | |--------|------|---------| | `axl_hf_evol_codealpaca_v1.txt` | HuggingFace (bigcode/starcoderdata) | OpenRAIL-M | | `axl_hf_Evol_Instruct_Code_80k_v1.txt` | HuggingFace (sahil2801/CodeAlpaca-20k) | Apache-2.0 | | `*_pairs*.txt` | Synthetic (generated by AXL scripts) | Apache-2.0 | | `axl_python_code*.txt` | Curated Python code | Apache-2.0 | | `shakespeare.txt` | Public domain | Public domain | ## Citation ```bibtex @misc{axl_2026, title={AXL: Multi-Scale Agentic Transformer for CPU-Optimized Code Generation}, author={Koinic}, year={2026}, url={https://github.com/Koinic/AXL} } ``` ## License Apache-2.0
提供机构:
KoinicLabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作