five

AISE-TUDelft/multilingual-code-comments-fixed-3

收藏
Hugging Face2026-02-16 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/AISE-TUDelft/multilingual-code-comments-fixed-3
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: Chinese features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string splits: - name: train num_bytes: 21642567 num_examples: 500 download_size: 8934584 dataset_size: 21642567 - config_name: Dutch features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string - name: error_codes_google/codegemma-7b dtype: string splits: - name: train num_bytes: 24071239 num_examples: 500 download_size: 9164593 dataset_size: 24071239 - config_name: English features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string splits: - name: train num_bytes: 20538377 num_examples: 500 download_size: 8127065 dataset_size: 20538377 - config_name: Greek features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string splits: - name: train num_bytes: 25646308 num_examples: 500 download_size: 9167874 dataset_size: 25646308 - config_name: Polish features: - name: file_id dtype: string - name: repo dtype: string - name: path dtype: string - name: content dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string splits: - name: train num_bytes: 17774200 num_examples: 500 download_size: 7229968 dataset_size: 17774200 configs: - config_name: Chinese data_files: - split: train path: Chinese/train-* - config_name: Dutch data_files: - split: train path: Dutch/train-* - config_name: English data_files: - split: train path: English/train-* - config_name: Greek data_files: - split: train path: Greek/train-* - config_name: Polish data_files: - split: train path: Polish/train-* ---

数据集信息如下: 本数据集包含5个语言配置,分别为中文(Chinese)、荷兰语(Dutch)、英语(English)、希腊语(Greek)、波兰语(Polish)。 ### 通用特征说明 所有语言配置均包含以下字符串类型(string)的特征字段: 1. 文件ID(file_id):用于唯一标识单个样本文件 2. 代码内容(content):存储目标代码片段的原始文本 3. 代码仓库(repo):标记该代码所属的远程代码仓库地址或名称 4. 文件路径(path):记录代码文件在对应仓库中的相对存储路径 5. 原始注释(original_comment):对应代码的原始自然语言注释文本 6. 掩码数据_Qwen/CodeQwen1.5-7B:经Qwen/CodeQwen1.5-7B模型处理后的掩码代码数据 7. 预测结果_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成的代码补全结果 8. 预测注释_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成的代码注释结果 9. 掩码数据_bigcode/starcoder2-7B:经bigcode/starcoder2-7B模型处理后的掩码代码数据 10. 预测结果_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成的代码补全结果 11. 预测注释_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成的代码注释结果 12. 掩码数据_ibm-granite/granite-8b-code-base:经ibm-granite/granite-8b-code-base模型处理后的掩码代码数据 13. 预测结果_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成的代码补全结果 14. 预测注释_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成的代码注释结果 15. 掩码数据_meta-llama/CodeLlama-7b-hf:经meta-llama/CodeLlama-7b-hf模型处理后的掩码代码数据 16. 预测结果_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成的代码补全结果 17. 预测注释_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成的代码注释结果 18. 掩码数据_google/codegemma-7b:经google/codegemma-7b模型处理后的掩码代码数据 19. 预测结果_google/codegemma-7b:google/codegemma-7b模型生成的代码补全结果 20. 预测注释_google/codegemma-7b:google/codegemma-7b模型生成的代码注释结果 21. 错误代码_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成结果对应的错误代码信息 22. 专家准确率_Qwen/CodeQwen1.5-7B:Qwen/CodeQwen1.5-7B模型生成结果的专家评估准确率 23. 错误代码_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成结果对应的错误代码信息 24. 专家准确率_bigcode/starcoder2-7B:bigcode/starcoder2-7B模型生成结果的专家评估准确率 25. 错误代码_google/codegemma-7b:google/codegemma-7b模型生成结果对应的错误代码信息 26. 专家准确率_google/codegemma-7b:google/codegemma-7b模型生成结果的专家评估准确率 27. 错误代码_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成结果对应的错误代码信息 28. 专家准确率_ibm-granite/granite-8b-code-base:ibm-granite/granite-8b-code-base模型生成结果的专家评估准确率 29. 错误代码_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成结果对应的错误代码信息 30. 专家准确率_meta-llama/CodeLlama-7b-hf:meta-llama/CodeLlama-7b-hf模型生成结果的专家评估准确率 ### 数据集划分详情 所有语言配置仅包含训练集(train)划分,各配置的具体参数如下: 1. 中文配置:训练集字节数为21642567,样本量为500;下载大小为8934584,数据集总大小为21642567 2. 荷兰语配置:训练集字节数为24071239,样本量为500;下载大小为9164593,数据集总大小为24071239 3. 英语配置:训练集字节数为20538377,样本量为500;下载大小为8127065,数据集总大小为20538377 4. 希腊语配置:训练集字节数为25646308,样本量为500;下载大小为9167874,数据集总大小为25646308 5. 波兰语配置:训练集字节数为17774200,样本量为500;下载大小为7229968,数据集总大小为17774200 ### 数据集文件配置 本数据集的5个语言配置均对应训练划分的数据文件,文件路径格式为`{语言名称}/train-*`,其中语言名称分别为Chinese(中文)、Dutch(荷兰语)、English(英语)、Greek(希腊语)、Polish(波兰语)。
提供机构:
AISE-TUDelft
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作