PyX
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/ARiSE-Lab/SemCoder
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个整洁的Python语料库,其中包含完全可执行的代码样本以及功能描述和测试用例,旨在捕捉程序的语义。该数据集经过筛选,仅包含干净的、易于学习的程序执行痕迹,这些程序不涉及外部资源交互,仅使用Python内置类型,并且限制为单一函数程序,其行为可预测。这样的规模确保了高质量、可解析且可执行的数据,适用于指令调整。该数据集的任务是训练代码语言模型,以实现语义理解和执行推理。
This dataset is a curated Python corpus comprising fully executable code samples, paired with functional descriptions and test cases, designed to capture program semantics. It has been strictly filtered to retain only clean, easy-to-learn program execution traces. These programs do not involve external resource interactions, exclusively use Python built-in types, are restricted to single-function programs, and exhibit predictable behaviors. Such a curated setup ensures high-quality, parsable and executable data suitable for instruction tuning. The purpose of this dataset is to train code language models to achieve semantic understanding and execution reasoning.



