BrainboxAI/code-training-il
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/BrainboxAI/code-training-il
下载链接
链接失效反馈官方服务:
资源简介:
Code-Training-IL 是一个用于训练小型编码助手的精选指令调优数据集,包含40,330个示例,其中20,000个Python示例来自NVIDIA的OpenCodeInstruct(经过测试通过率>50%的过滤),20,000个TypeScript示例来自bleugreen/typescript-instruct,以及330个手写的双语(希伯来语/英语)身份示例。该数据集旨在用于对小型基础模型(2B–8B)进行Python和TypeScript的微调,重点关注测试通过率过滤以提高模型质量。数据集也存在一些局限性,如仅支持Python和TypeScript、存在时间截断、以及以英语为主。数据集采用Apache 2.0许可证,由BrainboxAI的Netanel Elyasi维护。
Code-Training-IL is a curated instruction-tuning corpus for training small coding assistants, comprising 40,330 examples: 20,000 Python examples from NVIDIAs OpenCodeInstruct (filtered for test-pass rate > 50%), 20,000 TypeScript examples from bleugreen/typescript-instruct, and 330 hand-written bilingual (Hebrew/English) identity examples. The dataset is designed for fine-tuning small base models (2B–8B) on Python and TypeScript, with a focus on test-pass filtering to enhance model quality. Limitations include being restricted to Python and TypeScript, having a temporal cutoff, and being English-dominant. The dataset is licensed under Apache 2.0 and maintained by Netanel Elyasi of BrainboxAI.
提供机构:
BrainboxAI



