AISE-TUDelft/multilingual-code-comments-fixed-3

Name: AISE-TUDelft/multilingual-code-comments-fixed-3
Creator: AISE-TUDelft
Published: 2026-02-16 07:03:19
License: 暂无描述

Hugging Face2026-02-16 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/AISE-TUDelft/multilingual-code-comments-fixed-3

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: Chinese features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string splits: - name: train num_bytes: 21642567 num_examples: 500 download_size: 8934584 dataset_size: 21642567 - config_name: Dutch features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string - name: error_codes_google/codegemma-7b dtype: string splits: - name: train num_bytes: 24071239 num_examples: 500 download_size: 9164593 dataset_size: 24071239 - config_name: English features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string splits: - name: train num_bytes: 20538377 num_examples: 500 download_size: 8127065 dataset_size: 20538377 - config_name: Greek features: - name: file_id dtype: string - name: content dtype: string - name: repo dtype: string - name: path dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string splits: - name: train num_bytes: 25646308 num_examples: 500 download_size: 9167874 dataset_size: 25646308 - config_name: Polish features: - name: file_id dtype: string - name: repo dtype: string - name: path dtype: string - name: content dtype: string - name: original_comment dtype: string - name: masked_data_Qwen/CodeQwen1.5-7B dtype: string - name: predict_Qwen/CodeQwen1.5-7B dtype: string - name: predicted_comment_Qwen/CodeQwen1.5-7B dtype: string - name: masked_data_bigcode/starcoder2-7b dtype: string - name: predict_bigcode/starcoder2-7b dtype: string - name: predicted_comment_bigcode/starcoder2-7b dtype: string - name: masked_data_ibm-granite/granite-8b-code-base dtype: string - name: predict_ibm-granite/granite-8b-code-base dtype: string - name: predicted_comment_ibm-granite/granite-8b-code-base dtype: string - name: masked_data_meta-llama/CodeLlama-7b-hf dtype: string - name: predict_meta-llama/CodeLlama-7b-hf dtype: string - name: predicted_comment_meta-llama/CodeLlama-7b-hf dtype: string - name: masked_data_google/codegemma-7b dtype: string - name: predict_google/codegemma-7b dtype: string - name: predicted_comment_google/codegemma-7b dtype: string - name: error_codes_Qwen/CodeQwen1.5-7B dtype: string - name: expert_accuracy_Qwen/CodeQwen1.5-7B dtype: string - name: error_codes_bigcode/starcoder2-7b dtype: string - name: expert_accuracy_bigcode/starcoder2-7b dtype: string - name: error_codes_ibm-granite/granite-8b-code-base dtype: string - name: expert_accuracy_ibm-granite/granite-8b-code-base dtype: string - name: error_codes_meta-llama/CodeLlama-7b-hf dtype: string - name: expert_accuracy_meta-llama/CodeLlama-7b-hf dtype: string - name: error_codes_google/codegemma-7b dtype: string - name: expert_accuracy_google/codegemma-7b dtype: string splits: - name: train num_bytes: 17774200 num_examples: 500 download_size: 7229968 dataset_size: 17774200 configs: - config_name: Chinese data_files: - split: train path: Chinese/train-* - config_name: Dutch data_files: - split: train path: Dutch/train-* - config_name: English data_files: - split: train path: English/train-* - config_name: Greek data_files: - split: train path: Greek/train-* - config_name: Polish data_files: - split: train path: Polish/train-* ---

数据集信息如下：本数据集包含5个语言配置，分别为中文（Chinese）、荷兰语（Dutch）、英语（English）、希腊语（Greek）、波兰语（Polish）。 ### 通用特征说明所有语言配置均包含以下字符串类型（string）的特征字段： 1. 文件ID（file_id）：用于唯一标识单个样本文件 2. 代码内容（content）：存储目标代码片段的原始文本 3. 代码仓库（repo）：标记该代码所属的远程代码仓库地址或名称 4. 文件路径（path）：记录代码文件在对应仓库中的相对存储路径 5. 原始注释（original_comment）：对应代码的原始自然语言注释文本 6. 掩码数据_Qwen/CodeQwen1.5-7B：经Qwen/CodeQwen1.5-7B模型处理后的掩码代码数据 7. 预测结果_Qwen/CodeQwen1.5-7B：Qwen/CodeQwen1.5-7B模型生成的代码补全结果 8. 预测注释_Qwen/CodeQwen1.5-7B：Qwen/CodeQwen1.5-7B模型生成的代码注释结果 9. 掩码数据_bigcode/starcoder2-7B：经bigcode/starcoder2-7B模型处理后的掩码代码数据 10. 预测结果_bigcode/starcoder2-7B：bigcode/starcoder2-7B模型生成的代码补全结果 11. 预测注释_bigcode/starcoder2-7B：bigcode/starcoder2-7B模型生成的代码注释结果 12. 掩码数据_ibm-granite/granite-8b-code-base：经ibm-granite/granite-8b-code-base模型处理后的掩码代码数据 13. 预测结果_ibm-granite/granite-8b-code-base：ibm-granite/granite-8b-code-base模型生成的代码补全结果 14. 预测注释_ibm-granite/granite-8b-code-base：ibm-granite/granite-8b-code-base模型生成的代码注释结果 15. 掩码数据_meta-llama/CodeLlama-7b-hf：经meta-llama/CodeLlama-7b-hf模型处理后的掩码代码数据 16. 预测结果_meta-llama/CodeLlama-7b-hf：meta-llama/CodeLlama-7b-hf模型生成的代码补全结果 17. 预测注释_meta-llama/CodeLlama-7b-hf：meta-llama/CodeLlama-7b-hf模型生成的代码注释结果 18. 掩码数据_google/codegemma-7b：经google/codegemma-7b模型处理后的掩码代码数据 19. 预测结果_google/codegemma-7b：google/codegemma-7b模型生成的代码补全结果 20. 预测注释_google/codegemma-7b：google/codegemma-7b模型生成的代码注释结果 21. 错误代码_Qwen/CodeQwen1.5-7B：Qwen/CodeQwen1.5-7B模型生成结果对应的错误代码信息 22. 专家准确率_Qwen/CodeQwen1.5-7B：Qwen/CodeQwen1.5-7B模型生成结果的专家评估准确率 23. 错误代码_bigcode/starcoder2-7B：bigcode/starcoder2-7B模型生成结果对应的错误代码信息 24. 专家准确率_bigcode/starcoder2-7B：bigcode/starcoder2-7B模型生成结果的专家评估准确率 25. 错误代码_google/codegemma-7b：google/codegemma-7b模型生成结果对应的错误代码信息 26. 专家准确率_google/codegemma-7b：google/codegemma-7b模型生成结果的专家评估准确率 27. 错误代码_ibm-granite/granite-8b-code-base：ibm-granite/granite-8b-code-base模型生成结果对应的错误代码信息 28. 专家准确率_ibm-granite/granite-8b-code-base：ibm-granite/granite-8b-code-base模型生成结果的专家评估准确率 29. 错误代码_meta-llama/CodeLlama-7b-hf：meta-llama/CodeLlama-7b-hf模型生成结果对应的错误代码信息 30. 专家准确率_meta-llama/CodeLlama-7b-hf：meta-llama/CodeLlama-7b-hf模型生成结果的专家评估准确率 ### 数据集划分详情所有语言配置仅包含训练集（train）划分，各配置的具体参数如下： 1. 中文配置：训练集字节数为21642567，样本量为500；下载大小为8934584，数据集总大小为21642567 2. 荷兰语配置：训练集字节数为24071239，样本量为500；下载大小为9164593，数据集总大小为24071239 3. 英语配置：训练集字节数为20538377，样本量为500；下载大小为8127065，数据集总大小为20538377 4. 希腊语配置：训练集字节数为25646308，样本量为500；下载大小为9167874，数据集总大小为25646308 5. 波兰语配置：训练集字节数为17774200，样本量为500；下载大小为7229968，数据集总大小为17774200 ### 数据集文件配置本数据集的5个语言配置均对应训练划分的数据文件，文件路径格式为`{语言名称}/train-*`，其中语言名称分别为Chinese（中文）、Dutch（荷兰语）、English（英语）、Greek（希腊语）、Polish（波兰语）。

提供机构：

AISE-TUDelft

5,000+

优质数据集

54 个

任务类型

进入经典数据集