AISE-TUDelft/multilingual-code-comments-fixed-3
收藏Hugging Face2026-02-16 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/AISE-TUDelft/multilingual-code-comments-fixed-3
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: Chinese
features:
- name: file_id
dtype: string
- name: content
dtype: string
- name: repo
dtype: string
- name: path
dtype: string
- name: original_comment
dtype: string
- name: masked_data_Qwen/CodeQwen1.5-7B
dtype: string
- name: predict_Qwen/CodeQwen1.5-7B
dtype: string
- name: predicted_comment_Qwen/CodeQwen1.5-7B
dtype: string
- name: masked_data_bigcode/starcoder2-7b
dtype: string
- name: predict_bigcode/starcoder2-7b
dtype: string
- name: predicted_comment_bigcode/starcoder2-7b
dtype: string
- name: masked_data_ibm-granite/granite-8b-code-base
dtype: string
- name: predict_ibm-granite/granite-8b-code-base
dtype: string
- name: predicted_comment_ibm-granite/granite-8b-code-base
dtype: string
- name: masked_data_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predict_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predicted_comment_meta-llama/CodeLlama-7b-hf
dtype: string
- name: masked_data_google/codegemma-7b
dtype: string
- name: predict_google/codegemma-7b
dtype: string
- name: predicted_comment_google/codegemma-7b
dtype: string
- name: expert_accuracy_Qwen/CodeQwen1.5-7B
dtype: string
- name: error_codes_Qwen/CodeQwen1.5-7B
dtype: string
- name: expert_accuracy_bigcode/starcoder2-7b
dtype: string
- name: error_codes_bigcode/starcoder2-7b
dtype: string
- name: expert_accuracy_google/codegemma-7b
dtype: string
- name: error_codes_google/codegemma-7b
dtype: string
- name: expert_accuracy_ibm-granite/granite-8b-code-base
dtype: string
- name: error_codes_ibm-granite/granite-8b-code-base
dtype: string
- name: expert_accuracy_meta-llama/CodeLlama-7b-hf
dtype: string
- name: error_codes_meta-llama/CodeLlama-7b-hf
dtype: string
splits:
- name: train
num_bytes: 21642567
num_examples: 500
download_size: 8934584
dataset_size: 21642567
- config_name: Dutch
features:
- name: file_id
dtype: string
- name: content
dtype: string
- name: repo
dtype: string
- name: path
dtype: string
- name: original_comment
dtype: string
- name: masked_data_Qwen/CodeQwen1.5-7B
dtype: string
- name: predict_Qwen/CodeQwen1.5-7B
dtype: string
- name: predicted_comment_Qwen/CodeQwen1.5-7B
dtype: string
- name: masked_data_bigcode/starcoder2-7b
dtype: string
- name: expert_accuracy_Qwen/CodeQwen1.5-7B
dtype: string
- name: error_codes_Qwen/CodeQwen1.5-7B
dtype: string
- name: predict_bigcode/starcoder2-7b
dtype: string
- name: predicted_comment_bigcode/starcoder2-7b
dtype: string
- name: masked_data_ibm-granite/granite-8b-code-base
dtype: string
- name: expert_accuracy_bigcode/starcoder2-7b
dtype: string
- name: error_codes_bigcode/starcoder2-7b
dtype: string
- name: predict_ibm-granite/granite-8b-code-base
dtype: string
- name: predicted_comment_ibm-granite/granite-8b-code-base
dtype: string
- name: masked_data_meta-llama/CodeLlama-7b-hf
dtype: string
- name: expert_accuracy_ibm-granite/granite-8b-code-base
dtype: string
- name: error_codes_ibm-granite/granite-8b-code-base
dtype: string
- name: predict_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predicted_comment_meta-llama/CodeLlama-7b-hf
dtype: string
- name: masked_data_google/codegemma-7b
dtype: string
- name: expert_accuracy_meta-llama/CodeLlama-7b-hf
dtype: string
- name: error_codes_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predict_google/codegemma-7b
dtype: string
- name: predicted_comment_google/codegemma-7b
dtype: string
- name: expert_accuracy_google/codegemma-7b
dtype: string
- name: error_codes_google/codegemma-7b
dtype: string
splits:
- name: train
num_bytes: 24071239
num_examples: 500
download_size: 9164593
dataset_size: 24071239
- config_name: English
features:
- name: file_id
dtype: string
- name: content
dtype: string
- name: repo
dtype: string
- name: path
dtype: string
- name: original_comment
dtype: string
- name: masked_data_Qwen/CodeQwen1.5-7B
dtype: string
- name: predict_Qwen/CodeQwen1.5-7B
dtype: string
- name: predicted_comment_Qwen/CodeQwen1.5-7B
dtype: string
- name: masked_data_bigcode/starcoder2-7b
dtype: string
- name: predict_bigcode/starcoder2-7b
dtype: string
- name: predicted_comment_bigcode/starcoder2-7b
dtype: string
- name: masked_data_ibm-granite/granite-8b-code-base
dtype: string
- name: predict_ibm-granite/granite-8b-code-base
dtype: string
- name: predicted_comment_ibm-granite/granite-8b-code-base
dtype: string
- name: masked_data_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predict_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predicted_comment_meta-llama/CodeLlama-7b-hf
dtype: string
- name: masked_data_google/codegemma-7b
dtype: string
- name: predict_google/codegemma-7b
dtype: string
- name: predicted_comment_google/codegemma-7b
dtype: string
- name: error_codes_Qwen/CodeQwen1.5-7B
dtype: string
- name: expert_accuracy_Qwen/CodeQwen1.5-7B
dtype: string
- name: error_codes_bigcode/starcoder2-7b
dtype: string
- name: expert_accuracy_bigcode/starcoder2-7b
dtype: string
- name: error_codes_ibm-granite/granite-8b-code-base
dtype: string
- name: expert_accuracy_ibm-granite/granite-8b-code-base
dtype: string
- name: error_codes_meta-llama/CodeLlama-7b-hf
dtype: string
- name: expert_accuracy_meta-llama/CodeLlama-7b-hf
dtype: string
- name: error_codes_google/codegemma-7b
dtype: string
- name: expert_accuracy_google/codegemma-7b
dtype: string
splits:
- name: train
num_bytes: 20538377
num_examples: 500
download_size: 8127065
dataset_size: 20538377
- config_name: Greek
features:
- name: file_id
dtype: string
- name: content
dtype: string
- name: repo
dtype: string
- name: path
dtype: string
- name: original_comment
dtype: string
- name: masked_data_Qwen/CodeQwen1.5-7B
dtype: string
- name: predict_Qwen/CodeQwen1.5-7B
dtype: string
- name: predicted_comment_Qwen/CodeQwen1.5-7B
dtype: string
- name: masked_data_bigcode/starcoder2-7b
dtype: string
- name: predict_bigcode/starcoder2-7b
dtype: string
- name: predicted_comment_bigcode/starcoder2-7b
dtype: string
- name: masked_data_ibm-granite/granite-8b-code-base
dtype: string
- name: predict_ibm-granite/granite-8b-code-base
dtype: string
- name: predicted_comment_ibm-granite/granite-8b-code-base
dtype: string
- name: masked_data_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predict_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predicted_comment_meta-llama/CodeLlama-7b-hf
dtype: string
- name: masked_data_google/codegemma-7b
dtype: string
- name: predict_google/codegemma-7b
dtype: string
- name: predicted_comment_google/codegemma-7b
dtype: string
- name: error_codes_bigcode/starcoder2-7b
dtype: string
- name: error_codes_ibm-granite/granite-8b-code-base
dtype: string
- name: error_codes_meta-llama/CodeLlama-7b-hf
dtype: string
- name: error_codes_google/codegemma-7b
dtype: string
- name: error_codes_Qwen/CodeQwen1.5-7B
dtype: string
- name: expert_accuracy_bigcode/starcoder2-7b
dtype: string
- name: expert_accuracy_ibm-granite/granite-8b-code-base
dtype: string
- name: expert_accuracy_meta-llama/CodeLlama-7b-hf
dtype: string
- name: expert_accuracy_google/codegemma-7b
dtype: string
- name: expert_accuracy_Qwen/CodeQwen1.5-7B
dtype: string
splits:
- name: train
num_bytes: 25646308
num_examples: 500
download_size: 9167874
dataset_size: 25646308
- config_name: Polish
features:
- name: file_id
dtype: string
- name: repo
dtype: string
- name: path
dtype: string
- name: content
dtype: string
- name: original_comment
dtype: string
- name: masked_data_Qwen/CodeQwen1.5-7B
dtype: string
- name: predict_Qwen/CodeQwen1.5-7B
dtype: string
- name: predicted_comment_Qwen/CodeQwen1.5-7B
dtype: string
- name: masked_data_bigcode/starcoder2-7b
dtype: string
- name: predict_bigcode/starcoder2-7b
dtype: string
- name: predicted_comment_bigcode/starcoder2-7b
dtype: string
- name: masked_data_ibm-granite/granite-8b-code-base
dtype: string
- name: predict_ibm-granite/granite-8b-code-base
dtype: string
- name: predicted_comment_ibm-granite/granite-8b-code-base
dtype: string
- name: masked_data_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predict_meta-llama/CodeLlama-7b-hf
dtype: string
- name: predicted_comment_meta-llama/CodeLlama-7b-hf
dtype: string
- name: masked_data_google/codegemma-7b
dtype: string
- name: predict_google/codegemma-7b
dtype: string
- name: predicted_comment_google/codegemma-7b
dtype: string
- name: error_codes_Qwen/CodeQwen1.5-7B
dtype: string
- name: expert_accuracy_Qwen/CodeQwen1.5-7B
dtype: string
- name: error_codes_bigcode/starcoder2-7b
dtype: string
- name: expert_accuracy_bigcode/starcoder2-7b
dtype: string
- name: error_codes_ibm-granite/granite-8b-code-base
dtype: string
- name: expert_accuracy_ibm-granite/granite-8b-code-base
dtype: string
- name: error_codes_meta-llama/CodeLlama-7b-hf
dtype: string
- name: expert_accuracy_meta-llama/CodeLlama-7b-hf
dtype: string
- name: error_codes_google/codegemma-7b
dtype: string
- name: expert_accuracy_google/codegemma-7b
dtype: string
splits:
- name: train
num_bytes: 17774200
num_examples: 500
download_size: 7229968
dataset_size: 17774200
configs:
- config_name: Chinese
data_files:
- split: train
path: Chinese/train-*
- config_name: Dutch
data_files:
- split: train
path: Dutch/train-*
- config_name: English
data_files:
- split: train
path: English/train-*
- config_name: Greek
data_files:
- split: train
path: Greek/train-*
- config_name: Polish
data_files:
- split: train
path: Polish/train-*
---
数据集信息如下:
本数据集包含5个语言配置,分别为中文(Chinese)、荷兰语(Dutch)、英语(English)、希腊语(Greek)、波兰语(Polish)。
### 通用特征说明
所有语言配置均包含以下字符串类型(string)的特征字段:
1. 文件ID(file_id):用于唯一标识单个样本文件
2. 代码内容(content):存储目标代码片段的原始文本
3. 代码仓库(repo):标记该代码所属的远程代码仓库地址或名称
4. 文件路径(path):记录代码文件在对应仓库中的相对存储路径
5. 原始注释(original_comment):对应代码的原始自然语言注释文本
6. 掩码数据_Qwen/CodeQwen1.5-7B:经Qwen/CodeQwen1.5-7B模型处理后的掩码代码数据
7. 预测结果_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成的代码补全结果
8. 预测注释_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成的代码注释结果
9. 掩码数据_bigcode/starcoder2-7B:经bigcode/starcoder2-7B模型处理后的掩码代码数据
10. 预测结果_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成的代码补全结果
11. 预测注释_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成的代码注释结果
12. 掩码数据_ibm-granite/granite-8b-code-base:经ibm-granite/granite-8b-code-base模型处理后的掩码代码数据
13. 预测结果_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成的代码补全结果
14. 预测注释_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成的代码注释结果
15. 掩码数据_meta-llama/CodeLlama-7b-hf:经meta-llama/CodeLlama-7b-hf模型处理后的掩码代码数据
16. 预测结果_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成的代码补全结果
17. 预测注释_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成的代码注释结果
18. 掩码数据_google/codegemma-7b:经google/codegemma-7b模型处理后的掩码代码数据
19. 预测结果_google/codegemma-7b:google/codegemma-7b模型生成的代码补全结果
20. 预测注释_google/codegemma-7b:google/codegemma-7b模型生成的代码注释结果
21. 错误代码_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成结果对应的错误代码信息
22. 专家准确率_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成结果的专家评估准确率
23. 错误代码_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成结果对应的错误代码信息
24. 专家准确率_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成结果的专家评估准确率
25. 错误代码_google/codegemma-7b:google/codegemma-7b模型生成结果对应的错误代码信息
26. 专家准确率_google/codegemma-7b:google/codegemma-7b模型生成结果的专家评估准确率
27. 错误代码_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成结果对应的错误代码信息
28. 专家准确率_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成结果的专家评估准确率
29. 错误代码_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成结果对应的错误代码信息
30. 专家准确率_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成结果的专家评估准确率
### 数据集划分详情
所有语言配置仅包含训练集(train)划分,各配置的具体参数如下:
1. 中文配置:训练集字节数为21642567,样本量为500;下载大小为8934584,数据集总大小为21642567
2. 荷兰语配置:训练集字节数为24071239,样本量为500;下载大小为9164593,数据集总大小为24071239
3. 英语配置:训练集字节数为20538377,样本量为500;下载大小为8127065,数据集总大小为20538377
4. 希腊语配置:训练集字节数为25646308,样本量为500;下载大小为9167874,数据集总大小为25646308
5. 波兰语配置:训练集字节数为17774200,样本量为500;下载大小为7229968,数据集总大小为17774200
### 数据集文件配置
本数据集的5个语言配置均对应训练划分的数据文件,文件路径格式为`{语言名称}/train-*`,其中语言名称分别为Chinese(中文)、Dutch(荷兰语)、English(英语)、Greek(希腊语)、Polish(波兰语)。
提供机构:
AISE-TUDelft



