HuggingFaceTB/stack-edu
收藏Hugging Face2025-03-20 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/HuggingFaceTB/stack-edu
下载链接
链接失效反馈官方服务:
资源简介:
Stack-Edu是一个包含125B个标记的教育代码数据集,它是从The Stack v2数据集中筛选出来的。这个数据集是为语言模型训练而设计的。Stack-Edu使用了一种基于分类器的过滤策略,旨在保留最高质量的教育编程内容。Stack-Edu在所有编程语言上都在MultiPL-E基准上表现出比StarCoder2data更好的性能。
Stack-Edu is a 125B token dataset of educational code filtered from The Stack v2, specifically the curated training corpus of StarCoder2 models denoted StarCoder2Data. It is intended for Language Models training. Stack-Edu uses a classifier-based filtering strategy to retain only the highest-quality educational programming content. Stack-Edu shows consistent improvement over StarCoder2data on all the programming languages on MultiPL-E benchmark.
提供机构:
HuggingFaceTB



