five

NLBSE Code Comment Classification dataset

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/nlbse2023/code-comment-classification
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了从Java、Python和Pharo的多个项目中提取的代码注释。每一条注释都被分割成句子,并由人工进行标注,同时与提取它们的文件相关联。此外,该数据集在样本分布上并不均衡,负面样本的数量远超正面样本。数据集包含了具有ID、句子文本、类别或文件、分区、类别和实例类型的样本。在规模上,Java和Python每个类别的训练样本大约有2.4K和2.6K,而Pharo每个类别的样本则有大约1.8K。该数据集的任务是对代码注释进行分类。

This dataset contains code comments extracted from multiple projects developed in Java, Python, and Pharo. Each code comment is split into individual sentences, manually annotated by human annotators, and linked to the source code files from which it was extracted. Furthermore, the dataset exhibits an imbalanced sample distribution, with the number of negative samples far exceeding that of positive ones. Samples in the dataset include attributes such as ID, sentence text, category or source file, partition, category, and instance type. In terms of scale, there are approximately 2.4K and 2.6K training samples per category for Java and Python respectively, while there are around 1.8K samples per category for Pharo. The downstream task of this dataset is code comment classification.
提供机构:
NLBSE
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作