NLBSE Code Comment Classification dataset

Name: NLBSE Code Comment Classification dataset
Creator: NLBSE
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/nlbse2023/code-comment-classification

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了从Java、Python和Pharo的多个项目中提取的代码注释。每一条注释都被分割成句子，并由人工进行标注，同时与提取它们的文件相关联。此外，该数据集在样本分布上并不均衡，负面样本的数量远超正面样本。数据集包含了具有ID、句子文本、类别或文件、分区、类别和实例类型的样本。在规模上，Java和Python每个类别的训练样本大约有2.4K和2.6K，而Pharo每个类别的样本则有大约1.8K。该数据集的任务是对代码注释进行分类。

This dataset contains code comments extracted from multiple projects developed in Java, Python, and Pharo. Each code comment is split into individual sentences, manually annotated by human annotators, and linked to the source code files from which it was extracted. Furthermore, the dataset exhibits an imbalanced sample distribution, with the number of negative samples far exceeding that of positive ones. Samples in the dataset include attributes such as ID, sentence text, category or source file, partition, category, and instance type. In terms of scale, there are approximately 2.4K and 2.6K training samples per category for Java and Python respectively, while there are around 1.8K samples per category for Pharo. The downstream task of this dataset is code comment classification.

提供机构：

NLBSE

5,000+

优质数据集

54 个

任务类型

进入经典数据集