CodeReviewCommentsNER
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10060889
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains code review comments from various sources (Github projects, Android Gerrit, Tizen Gerrit). It contains 3000 comments with 15420 manually-labeled named entities for the token classification task. The list of classes is following: Variable, Function, Class, Value, File_Name, File_Type, Keyword, Data_Type, Library_Package, Error_Name, HTML_XML_Tag, Operating_System, Programming_Language, External_Tool, Website. IOB2 format was used. The archive also contains data already divided into train and test. Our best model, based on CodeBERT, showed F1 score for tokens = 0.7347; Type F1 = 0.7703, Strict F1 = 0.7290, according to metrics from the nervaluate library.
本数据集包含来自多源的代码评审评论,涵盖GitHub项目、Android Gerrit及Tizen Gerrit三大来源。数据集共包含3000条评论,以及15420个经人工标注的命名实体,用于Token分类任务。本次标注的类别列表如下:变量(Variable)、函数(Function)、类(Class)、数值(Value)、文件名(File_Name)、文件类型(File_Type)、关键字(Keyword)、数据类型(Data_Type)、库包(Library_Package)、错误名称(Error_Name)、HTML/XML标签(HTML_XML_Tag)、操作系统(Operating_System)、编程语言(Programming_Language)、外部工具(External_Tool)、网站(Website)。本次标注采用IOB2格式。该压缩归档文件还包含已划分为训练集与测试集的数据。基于CodeBERT构建的最优模型,根据nervaluate库的评估指标,其Token的F1值为0.7347,类型级F1值为0.7703,严格匹配F1值为0.7290。
创建时间:
2023-11-01



