Replication Package of the paper "Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality"

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/15024202

下载链接

链接失效反馈

官方服务：

资源简介：

Large Language Models for Multilingual Code Generation: A Benchmark and a Study on Code Quality Abstract Having been trained in the wild, Large Language Models (LLMs) may suffer from different types of bias. As shown in previous studies outside software engineering, this includes a language bias, i.e., these models perform differently depending on the language used for the query/prompt. However, so far the impact of language bias on source code generation has not been thoroughly investigated. Therefore, in this paper, we study the influence of the language adopted in the prompt on the quality of the source code generated by three LLMs, specifically GPT, Claude, and DeepSeek. We consider 230 coding tasks for Python and 230 for Java, and translate their related prompts into four languages: Chinese, Hindi, Spanish, and Italian. After generating the code, we measure code quality in terms of passed tests, code metrics, warnings generated by static analysis tools, and language used for the identifiers. Results indicate that (i) source code generated from the English queries is not necessarily better in terms of passed test and quality metrics, (ii) the quality for different languages varies depending on the programming language and LLM being used, and (iii) the generated code tend to contain mixes of comments and literals written in English and the language used to formulate the prompt. Replication Package This replication package is organized into two main directories: data and scripts. The datadirectory contains all the data used in the analysis, including prompts and final results. The scripts directory contains all the Python scripts used for code generation and analysis. Data The data directory contains five subdirectories, each corresponding to a stage in the analysis pipeline. These are enumerated to reflect the order of the process: prompt_translation: Contains files with manually translated prompts for each language. Each file is associated with both Python and Java. The structure of each file is as follows: id: The ID of the query in the CoderEval benchmark. prompt: The original English prompt. summary: The original summary. code: The original code. translation: The translation generated by GPT. correction: The manual correction of the GPT-generated translation. correction_tag: A list of tags indicating the corrections made to the translation. generated_code: This column is initially empty and will contain the code generated from the translated prompt. generation: Contains the code generated by the three LLMs for each programming language and natural language. Each subdirectory (e.g., java_chinese_claude) contains the following: files: The files with the generated code (named by the query ID). report: Reports generated by static analysis tools. A CSV file (e.g., java_chinese_claude.csv) containing the generated code in the corresponding column. tests: Contains input files for the testing process and the results of the tests. Files in the input_files directory are formatted according to the CoderEval benchmark requirements. The results directory holds the output of the testing process. qualitative_analysis: Contains files used for the qualitative analysis: CohenKappaagreement.csv: A file containing the subset used to compute Cohen's kappa metrics for manual analysis. files: Contains all files for the qualitative analysis. Each file has the following columns: id: The ID of the query in the CoderEval benchmark. generated_code: The code generated by the model. comments: The language used for comments. identifiers: The language used for identifiers. literals: The language used for literals. notes: Additional notes. ablation_study: Contains files for the ablation study. Each file has the following columns: id: The ID of the query in the CoderEval benchmark. prompt: The prompt used for code generation. generated_code, comments, identifiers, and literals: Same as in the qualitative analysis. Files prefixed with italian contain prompts with signatures and docstrings translated into Italian. The system prompt used is the same as the initial one (see the paper). Files with the english prefix have prompts with the original signature (in English) and the docstring in Italian. The system prompt differs as follows: You are an AI that only responds with Python code. You will be given a function signature and its docstring by the user. Write your full implementation (restate the function signature). Use a Python code block to write your response. Comments and identifiers must be in Italian. For example: ```python print("Hello World!") Scripts The scripts directory contains all the scripts used to perform all the generations and analysis. All files are properly commented. Here a brief description of each file: code_generation.py: This script automates code generation using AI models (GPT, DeepSeek, and Claude) for different programming and natural languages. It reads prompts from CSV files, generates code based on the prompts, and saves the results in structured directories. It logs the process, handles errors, and stores the generated code in separate files for each iteration. computeallanalysis.py: This script performs static code analysis on generated code files using different models, languages, and programming languages. It runs various analyses (Flake8, Pylint, Lizard) depending on the programming language: for Python, it runs all three analyses, while for Java, only Lizard is executed. The results are stored in dedicated report directories for each iteration. The script ensures the creation of necessary directories and handles any errors that occur during the analysis process. createtestjava.py: This script processes Java code generated by different models and languages, extracting methods using a JavaParser server. It iterates through multiple iterations of generated code, extracts the relevant method code (or uses the full code if no method is found), and stores the results in a JSONL file for each language and model combination. deepseek_model.py: This function sends a request to the DeepSeek API, passing a system and user prompt, and extracts the generated code snippet based on the specified programming language. It prints the extracted code in blue to the console, and if any errors occur during the request or extraction, it prints an error message in red. If successful, it returns the extracted code snippet; otherwise, it returns None. extractpmdreport.py: This script processes PMD analysis reports in SARIF format and converts them into CSV files. It extracts the contents of ZIP files containing the PMD reports, parses the SARIF file to gather analysis results, and saves the findings in a CSV file. The output includes details such as file names, rules, messages, and the count of issues found. The script iterates through multiple languages, models, and iterations, ensuring that PMD reports are properly processed and saved for each combination. flake_analysis.py: The flake_analysis function runs Flake8 to analyze Python files for errors and generates a CSV report summarizing the results. It processes the output, extracting error details such as filenames, error codes, and messages. The errors are grouped by file and saved in a CSV file for easy review. generatepredictionclaude_java.py: The generatecodefrom_prompt function processes a JSON file containing prompts, generates Java code using the Claude API, and saves the generated code to a new JSON file. It validates each prompt, ensures it's JSON-serializable, and sends it to the Claude API for code generation. If the generation is successful, the code is stored in a structured format, and the output is saved to a JSON file for further use. generatepredictionclaude_python.py: This code defines a function generatecodefrom_prompt that processes a JSON file containing prompts, generates Python code using the Claude API, and saves the generated code to a new JSON file. It handles invalid values and ensures all prompts are JSON-serializable before sending them for code generation. For each valid prompt, it generates Python code and stores the result with associated task IDs, prompts, and languages in a new JSON file. gitactionjava.py: This script automates the process of performing static analysis using SonarCloud and PMD. It handles cleaning and copying files, commits changes to GitHub, waits for SonarCloud analysis, extracts SonarCloud reports, and downloads the latest PMD SARIF report. The process is repeated for multiple languages and models across several iterations. gitactionpython.py: This script automates the workflow of running SonarCloud analysis for different programming languages and models. It performs the following steps: it cleans up the analysis directory, copies generated files to that directory, commits and pushes the changes to GitHub, waits for the analysis to complete, and then extracts the SonarCloud report for each iteration. The process repeats for multiple languages and models, allowing for systematic analysis and reporting. gpt_mode.py: This script defines a function ask_Gpt() that sends a system and user prompt to the GPT-4 model to generate a code snippet. It then extracts the relevant code from the response using a helper function and prints it with color formatting for better visibility. The function returns the extracted code for further use. javaparser.py: This JavaParserClient class provides an interface to interact with a Java parser server. It includes methods to convert GitHub URLs to raw content URLs, make HTTP requests, generate SHA-256 hashes, and extract Java code details such as class names, imports, and method signatures. lizard_analysis.py: This function performs a detailed analysis of Python and Java source code files using the Lizard tool. It walks through a directory, analyzes each .py and .java file, and collects key metrics such as lines of code (nloc), cyclomatic complexity (ccn), token count, parameter count, function length, and top nesting level. These metrics are aggregated for each file and stored in a CSV report, which is saved in the specified output directory. pylint_analysis.py: This script runs a Pylint analysis on Python files within a specified directory or file, collects the results in JSON format, and processes the data to generate a CSV report. It groups errors by module, merges related symbols and messages, counts the occurrences of errors per module, and stores the results in a structured CSV file. The function also handles errors gracefully, ensuring that the user is notified if issues arise during the analysis or when reading the output. sonarcloud_analysis.py: This code fetches project metrics from SonarCloud, processes the data into a structured format using pandas, and saves it to a CSV file. It gathers information like code complexity, code smells, vulnerabilities, and other key indicators. The sonarCloudExtraction function automates the entire process, from fetching the metrics to saving the results in a CSV. utils.py: This code provides utility functions for handling code snippets, counting tokens, and extracting method signatures from text. multilingual-script.R: R script to perform the statistical analyses for addressing RQ1 and RQ2, and to computer the inter-rater agreements for RQ3

创建时间：

2025-03-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集