Multilingual paired code and comment changes
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10138302
下载链接
链接失效反馈官方服务:
资源简介:
Dataset used for the master's thesis "LLMs for Code Comment Consistency." Covers the languages Go, Java, JavaScript, TypeScript, and Python. All data is mined from permissively-licensed GitHub public projects.
This dataset consists of pairs of function/method code blocks and their documentation comments, before and after commits.Examples are labeled 0 if the comment was not changed before and after, and 1 if the comment was changed. For the purpose of comment consistency, that means a 1-labeled example has an old comment that is inconsistent with the new code.If you're training a code summarization or comment generation task, then of course ignore the classification label.
All-22k contains the training, validation, and test set used in the models trained in the paper. The examples are balanced by language and between the positive and negative classes. Any code repository is only present in one of these sets.
创建时间:
2023-11-16



