Replication package of the paper "Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot"

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/13151631

下载链接

链接失效反馈

官方服务：

资源简介：

Replication Package This replication package contains the necessary tools, data, and scripts for reproducing the results of our paper: "Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot". Below is a detailed description of the directory structure and the contents of this package. Contents The replication package is organized into two main directories: assets: This directory contains all .csv files used as input for the script and the outputted .csv file used to perform the manual and automated analyses for RQ1 and RQ2. script: This directory contains all scripts for RQ1 and RQ2. In the following, we describe the content of each directory: assets This directory contains the tools and resources required for our study. dataset: Contains the main datasets used in the study. annotationStore.csv: Input dataset for our analyses, originating from the CODESEARCHNET dataset. queries.csv: .csv file containing the queries used for the experiments filtered from the CODESEARCHNETdataset. This file contains the following columns: Language: Programming language of the query Query: Query used for the experiment GitHubUrl: GitHub URL related to a snippet that addresses the query Relevance: Relevance of the linked GitHub snippet to the query data: Contains the datasets and results of all analyses. queries.csv: General input queries. This file contains the following columns: Language: Programming language of the query Query: Query used for the snippet generation Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about: queries_filled.csv: Similar to the previous file, but also containing the output produced by the LLM-based assistants. This file contains the following columns: Language: Programming language of the query Query: Query used for the snippet generation Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about: Notes: General notes that provide additional context or information about the query or prompt. Gemini_Answer(n): The generated code snippets by Gemini. Gemini(n): The external links provided by Gemini. Prompt (repeated) Note: Notes that provide additional context or information about the query or prompt. Copilot_Answer(n): The generated code snippets by Bing-Copilot. Copilot_Bing(n): The external links provided by Bing-Copilot. copilot || gemini: Contains the data related to the specific LLM. These two subdirectories have the same internal structure. queries.csv: The queries_filled.csv file, filtered for the specific LLM. queries_noTrivial.csv: Contains only the queries with at least one nontrivial generated snippet. external_links.csv: External links extracted from the LLMs output. external_links_filled.csv: Snippets extracted from the external links. index: Query ID source: Snippet ID url: Link URL note: Notes that provide additional context or information about the query or prompt code(n): The n-th code snippet extracted from the source manual_analysis: Manual analysis results. manual_analysis.csv: index: Query ID query: Query used for the snippet generation generatedsnippet(n): The n-th code snippet generated by the LLM-based assistant trivial_1: Manual analysis of whether or not the snippet was trivial (validator 1) trivial_2: Manual analysis of whether or not the snippet was trivial (validator 2) trivial_final: Manual analysis of whether or not the snippet was trivial (final classification if there is a disagreement) source: URL to analyze sourcetype1: Type of the source (validator 1) sourcetype2: Type of the source (validator 2) sourcetypefinal: Type of the source (final classification if there is a disagreement) relatedtoquery_1: Relevance of the link to the query (validator 1) relatedtoquery_2: Relevance of the link to the query (validator 2) relatedtoquery_final: Relevance of the link to the query (final classification if there is a disagreement) relatedtosnippets_1: Relevance of the generated snippet to those in the link (validator 1) relatedtosnippets_2: Relevance of the generated snippet to those in the link (validator 2) relatedtosnippets_final: Relevance of the generated snippet to those in the link (final classification if there is a disagreement) manual_analysis_noTrivial.csv: As in the previous file, but only the queries with at least one nontrivial generated code snippet. clone_detector: Output and intermediate files for clone detection with Copilot data. copilot_tokens || gemini_tokens: Contains the output the tokenization of the generated code snippets and the code snippets extracted from the external links. merged_llm_ext_link.csv: All possible pairs (Cartesian product) (code snippet extracted from the external links, generated code snippet). This file is the input of the clone detection tool. ID_query: Query ID query: Query used for the snippet generation language: Programming language of the query generated_snippet: The generated code snippet by the LLM-based assistant IDgensnippet: The index of the generated code snippet LOCgensnippet: The number of lines of code of the generated code snippet ID_source: Source ID source: Source URL source_snippet: Code snippet extracted from the source IDsourcesnippet: ID of the code snippet extracted from the source LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source note: Notes that provide additional context or information about the query or prompt clone_detection_output.csv: Contains the clone detection results. ID_query: The index of the query query: Query used for the snippet generation language: The programming language of the query generated_snippet: The generated code snippet by the LLM-based assistant IDgensnippet: The index of the generated code snippet LOCgensnippet: The number of lines of code of the generated code snippet ID_source: Source ID source: Source URL source_snippet: Code snippet extracted from the source IDsourcesnippet: ID of the code snippet extracted from the source LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source note: Notes that provide additional context or information about the query or prompt clone_detected: bBolean value that indicates whether a clone has been detected (1 = detected, 0 = not detected) cloning_ratio: Ratio of the number of lines of code of the generated code snippet has been detected as a clone in the code snippet extracted from the source cloned_lines: The number of lines of code of the generated code snippet that has been detected as a clone in the code snippet extracted from the source cosine_sim: Cosine similarity results. cosine_sim_output.csv: Contains the cosine similarity results query_id: Query ID snippet_id:ID the generated code snippet source_id: ID of the source sourcesnippetid: ID of the code snippet extracted from the source cosine_similarity: The cosine similarity between the generated code snippet and the code snippet extracted from the source quant_analysis: Quantitative analysis results. topN_links_se.csv: Contains the top-N links extracted from the search engine. id: Query ID query: The query url: Link URL merged_clone_cosine.csv: Contains the merged results of the clone detection and cosine similarity. ID_query: Query ID query: The query language: The programming language of the query generated_snippet: The generated code snippet by the LLM-based assistant IDgensnippet: The ID of the generated code snippet LOCgensnippet: The number of lines of code of the generated code snippet ID_source: The index of the source source: The source URL source_snippet: The code snippet extracted from the source IDsourcesnippet: The index of the code snippet extracted from the source LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source note: Notes that provide additional context or information about the query or prompt clone_detected: Boolean value that indicates if a clone has been detected(1 = detected, 0 = not detected) cloning_ratio: The ratio of the number of lines of code of the generated code snippet has been detected as a clone in the code snippet extracted from the source cloned_lines: The number of lines of code of the generated code snippet that has been detected as a clone in the code snippet extracted from the source cosine_similarity: The cosine similarity between the generated code snippet and the code snippet extracted from the source other_analysis: Contains more performed analysis. sample_queries.csv: Contains a sample of five queries for language used for perform the chain of thought experiment. Language: Programming language of the query Query: Query used for the snippet generation Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about: chain_of_thought.csv: Language: Programming language of the query Query: Query used for the snippet generation Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about: Clone_fonud: Boolean value that indicates if a clone has been detected (Yes = detected, No = not detected) Note: Notes that provide additional context or information about the performed analysis data_check.csv: Link: URL of the source provided by the LLM Post_date: Indicates if the date of the post is before/after the date of training of the LLM (before 2023, after 2023, not provided) Note: Notes that provide additional context or information about the performed analysis results: Final analysis results. jaccard_analysis.csv: Contains the results of the Jaccard analysis comparing the provided external links by the LLMs with the top-N links extracted from the corresponding search engine. id: Query ID language: The programming language of the query llm_link: The external links provided by the LLM llmlinksize: The number of external links provided by the LLM overlap_links: The overlapping links between the LLM and the search engine overlap_size: The number of overlapping links between the LLM and the search engine nonoverlaplinks: The non-overlapping links between the LLM and the search engine union_size: The size of the union set links between the LLM and the search engine jaccard: The Jaccard similarity between the LLM and the search engine merged_analysis.csv: Contains the merged results of the manual and quantitative analyses. id: The index of the query query: The query used for the experiment trivial_final(n): The final assignment for the triviality of the n-th generated code snippet source: The URL of the source sourcetypefinal: The final assignment for the type of the source relatedtoquery_final: The final assignment for the relevance of the generated code snippet to the query relatedtosnippets_final: The final assignment for the relevance of the generated code snippet to the source cloning_ratio: The maximum cloning ratio between the generated code snippet and all the code snippets extracted from the source cosine_similarity: The cosine similarity related to the snippets with maximum cloning ratio cccfindersw-configuration-files: Contains additional configuration files for the CCFinderSW clone detection tool. The files are javascript_comment.txt and javascript_reserved.txt. They must be placed in the tool's comment/ and reserved/ directories. appendix.tex: The appendix of the paper containing: Table 1: Number of links of different types provided by Gemini and Bing CoPilot appendix.pdf: The appendix of the paper in PDF format. script This directory contains our scripts (mostly Python, an R script and an Applescript) to preprocess data and run the clone detection analyses. 1_dateset_filtering.py: Script to filter the dataset. The input of this script is the annotationStore.csv file, and the output is the queries.csv file. 2_prompt_generation.py: Script to generate prompts for the LLM-based assistants. The input of this script is the queries.csv file, and the output is the queries_filled.csv file. 3_gen_sheet_sources_extraction.py: Script to split the external links provided by the LLM, one for each row. The input of this script is the queries_filled.csv file, and the output is the external_link.csv file. 4_ext_link_snippet_extraction.py: Script to extract the snippets from Web URLs. It only works for the most popular domains. The input of this script is the external_links.csv file, and the output is the external_links_filled.csv file. 5_top_n_link_SearchEngine.py: Script to perform top-N link search using the corresponding search engines (Google Search and Bing). The input of this script is the queries_filled.csv file. It executes the browser_bot.scpt. The output is the topN_links_se.csv file. browser_bot.scpt: Script for browser automation (AppleScript). 6_se_vs_llm.py: Script to compare (using the Jaccard metric) the links returned by the corresponding search engines with those provided by the LLM-based assistants. The input of this script is the external_links.csv file and the topN_links_se.csv file. The output is the jaccaard_analysis.csv file. 7_results_manual_analysis.py: Script to extract results and statistical analyses performed on the manual analysis and reported in the tables in the paper. 8_merge_gen_source_snippets.py: This script takes as input: {llm}/queries.csv and {llm}/external_link_filled.csv to merge them and generates an expanded one, i.e., one in which we have on each line a snippet extracted from the source, this will be the input of our final script for clone detection. The output is the merged_llm_ext_link.csv file. 9_clone_detection.py: Script to perform clone detection. The input of this script is the merged_llm_ext_link.csv file, and the output is the clone_detection_output.csv file. 10_cosine_sim_check.py: Script to compute the code snippets' cosine similarity. The script takes as input the tokenized files from the {llm}_tokens directory. The output is the cosine_sim_output.csvfile. 11_merger_clone_cosine.py: Script to merge clone detection and cosine similarity results. The input of this script is the clone_detection_output.csv and the cosine_sim_output.csv files, and the output is the merged_clone_cosine.csv file. 12_merge_manual_quantitative_analysis.py: Script to merge manual and quantitative analysis results. The inputs of this script are the manual_analysis.csv and the merged_clone_cosine.csvfiles, and the output is the merged_analysis.csv file. 13_sample_for_COT_analysis.py: Script to collect the sample of queries on which we perform the chain of thought analysis, the output is the sample_queries.csv file. 14_llm_vs_csn.py: Script to check the overlap between the links provided by the LLM and the one associated with the related query in the CodeSearchNet dataset. The input of this script is the queries.csvand queries_noTrivial.csv file. cloningGraph.R: R script to generate the cloning graph. The input of this script is the merged_analysis.csv file.

创建时间：

2025-01-17