Replication package of the paper "Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot"
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13151631
下载链接
链接失效反馈官方服务:
资源简介:
Replication Package
This replication package contains the necessary tools, data, and scripts for reproducing the results of our paper: "Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot". Below is a detailed description of the directory structure and the contents of this package.
Contents
The replication package is organized into two main directories:
assets: This directory contains all .csv files used as input for the script and the outputted .csv file used to perform the manual and automated analyses for RQ1 and RQ2.
script: This directory contains all scripts for RQ1 and RQ2.
In the following, we describe the content of each directory:
assets
This directory contains the tools and resources required for our study.
dataset: Contains the main datasets used in the study.
annotationStore.csv: Input dataset for our analyses, originating from the CODESEARCHNET dataset.
queries.csv: .csv file containing the queries used for the experiments filtered from the CODESEARCHNETdataset. This file contains the following columns:
Language: Programming language of the query
Query: Query used for the experiment
GitHubUrl: GitHub URL related to a snippet that addresses the query
Relevance: Relevance of the linked GitHub snippet to the query
data: Contains the datasets and results of all analyses.
queries.csv: General input queries. This file contains the following columns:
Language: Programming language of the query
Query: Query used for the snippet generation
Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:
queries_filled.csv: Similar to the previous file, but also containing the output produced by the LLM-based assistants. This file contains the following columns:
Language: Programming language of the query
Query: Query used for the snippet generation
Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:
Notes: General notes that provide additional context or information about the query or prompt.
Gemini_Answer(n): The generated code snippets by Gemini.
Gemini(n): The external links provided by Gemini.
Prompt (repeated)
Note: Notes that provide additional context or information about the query or prompt.
Copilot_Answer(n): The generated code snippets by Bing-Copilot.
Copilot_Bing(n): The external links provided by Bing-Copilot.
copilot || gemini: Contains the data related to the specific LLM. These two subdirectories have the same internal structure.
queries.csv: The queries_filled.csv file, filtered for the specific LLM.
queries_noTrivial.csv: Contains only the queries with at least one nontrivial generated snippet.
external_links.csv: External links extracted from the LLMs output.
external_links_filled.csv: Snippets extracted from the external links.
index: Query ID
source: Snippet ID
url: Link URL
note: Notes that provide additional context or information about the query or prompt
code(n): The n-th code snippet extracted from the source
manual_analysis: Manual analysis results.
manual_analysis.csv:
index: Query ID
query: Query used for the snippet generation
generatedsnippet(n): The n-th code snippet generated by the LLM-based assistant
trivial_1: Manual analysis of whether or not the snippet was trivial (validator 1)
trivial_2: Manual analysis of whether or not the snippet was trivial (validator 2)
trivial_final: Manual analysis of whether or not the snippet was trivial (final classification if there is a disagreement)
source: URL to analyze
sourcetype1: Type of the source (validator 1)
sourcetype2: Type of the source (validator 2)
sourcetypefinal: Type of the source (final classification if there is a disagreement)
relatedtoquery_1: Relevance of the link to the query (validator 1)
relatedtoquery_2: Relevance of the link to the query (validator 2)
relatedtoquery_final: Relevance of the link to the query (final classification if there is a disagreement)
relatedtosnippets_1: Relevance of the generated snippet to those in the link (validator 1)
relatedtosnippets_2: Relevance of the generated snippet to those in the link (validator 2)
relatedtosnippets_final: Relevance of the generated snippet to those in the link (final classification if there is a disagreement)
manual_analysis_noTrivial.csv: As in the previous file, but only the queries with at least one nontrivial generated code snippet.
clone_detector: Output and intermediate files for clone detection with Copilot data.
copilot_tokens || gemini_tokens: Contains the output the tokenization of the generated code snippets and the code snippets extracted from the external links.
merged_llm_ext_link.csv: All possible pairs (Cartesian product) (code snippet extracted from the external links, generated code snippet). This file is the input of the clone detection tool.
ID_query: Query ID
query: Query used for the snippet generation
language: Programming language of the query
generated_snippet: The generated code snippet by the LLM-based assistant
IDgensnippet: The index of the generated code snippet
LOCgensnippet: The number of lines of code of the generated code snippet
ID_source: Source ID
source: Source URL
source_snippet: Code snippet extracted from the source
IDsourcesnippet: ID of the code snippet extracted from the source
LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source
note: Notes that provide additional context or information about the query or prompt
clone_detection_output.csv: Contains the clone detection results.
ID_query: The index of the query
query: Query used for the snippet generation
language: The programming language of the query
generated_snippet: The generated code snippet by the LLM-based assistant
IDgensnippet: The index of the generated code snippet
LOCgensnippet: The number of lines of code of the generated code snippet
ID_source: Source ID
source: Source URL
source_snippet: Code snippet extracted from the source
IDsourcesnippet: ID of the code snippet extracted from the source
LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source
note: Notes that provide additional context or information about the query or prompt
clone_detected: bBolean value that indicates whether a clone has been detected (1 = detected, 0 = not detected)
cloning_ratio: Ratio of the number of lines of code of the generated code snippet has been detected as a clone in the code snippet extracted from the source
cloned_lines: The number of lines of code of the generated code snippet that has been detected as a clone in the code snippet extracted from the source
cosine_sim: Cosine similarity results.
cosine_sim_output.csv: Contains the cosine similarity results
query_id: Query ID
snippet_id:ID the generated code snippet
source_id: ID of the source
sourcesnippetid: ID of the code snippet extracted from the source
cosine_similarity: The cosine similarity between the generated code snippet and the code snippet extracted from the source
quant_analysis: Quantitative analysis results.
topN_links_se.csv: Contains the top-N links extracted from the search engine.
id: Query ID
query: The query
url: Link URL
merged_clone_cosine.csv: Contains the merged results of the clone detection and cosine similarity.
ID_query: Query ID
query: The query
language: The programming language of the query
generated_snippet: The generated code snippet by the LLM-based assistant
IDgensnippet: The ID of the generated code snippet
LOCgensnippet: The number of lines of code of the generated code snippet
ID_source: The index of the source
source: The source URL
source_snippet: The code snippet extracted from the source
IDsourcesnippet: The index of the code snippet extracted from the source
LOCsourcesnippet: The number of lines of code of the code snippet extracted from the source
note: Notes that provide additional context or information about the query or prompt
clone_detected: Boolean value that indicates if a clone has been detected(1 = detected, 0 = not detected)
cloning_ratio: The ratio of the number of lines of code of the generated code snippet has been detected as a clone in the code snippet extracted from the source
cloned_lines: The number of lines of code of the generated code snippet that has been detected as a clone in the code snippet extracted from the source
cosine_similarity: The cosine similarity between the generated code snippet and the code snippet extracted from the source
other_analysis: Contains more performed analysis.
sample_queries.csv: Contains a sample of five queries for language used for perform the chain of thought experiment.
Language: Programming language of the query
Query: Query used for the snippet generation
Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:
chain_of_thought.csv:
Language: Programming language of the query
Query: Query used for the snippet generation
Prompt: LLM prompt generated for the query as: You are a Senior developer. Then give me a code snippet about:
Clone_fonud: Boolean value that indicates if a clone has been detected (Yes = detected, No = not detected)
Note: Notes that provide additional context or information about the performed analysis
data_check.csv:
Link: URL of the source provided by the LLM
Post_date: Indicates if the date of the post is before/after the date of training of the LLM (before 2023, after 2023, not provided)
Note: Notes that provide additional context or information about the performed analysis
results: Final analysis results.
jaccard_analysis.csv: Contains the results of the Jaccard analysis comparing the provided external links by the LLMs with the top-N links extracted from the corresponding search engine.
id: Query ID
language: The programming language of the query
llm_link: The external links provided by the LLM
llmlinksize: The number of external links provided by the LLM
overlap_links: The overlapping links between the LLM and the search engine
overlap_size: The number of overlapping links between the LLM and the search engine
nonoverlaplinks: The non-overlapping links between the LLM and the search engine
union_size: The size of the union set links between the LLM and the search engine
jaccard: The Jaccard similarity between the LLM and the search engine
merged_analysis.csv: Contains the merged results of the manual and quantitative analyses.
id: The index of the query
query: The query used for the experiment
trivial_final(n): The final assignment for the triviality of the n-th generated code snippet
source: The URL of the source
sourcetypefinal: The final assignment for the type of the source
relatedtoquery_final: The final assignment for the relevance of the generated code snippet to the query
relatedtosnippets_final: The final assignment for the relevance of the generated code snippet to the source
cloning_ratio: The maximum cloning ratio between the generated code snippet and all the code snippets extracted from the source
cosine_similarity: The cosine similarity related to the snippets with maximum cloning ratio
cccfindersw-configuration-files:
Contains additional configuration files for the CCFinderSW clone detection tool. The files are javascript_comment.txt and javascript_reserved.txt. They must be placed in the tool's comment/ and reserved/ directories.
appendix.tex: The appendix of the paper containing:
Table 1: Number of links of different types provided by Gemini and Bing CoPilot
appendix.pdf: The appendix of the paper in PDF format.
script
This directory contains our scripts (mostly Python, an R script and an Applescript) to preprocess data and run the clone detection analyses.
1_dateset_filtering.py: Script to filter the dataset. The input of this script is the annotationStore.csv file, and the output is the queries.csv file.
2_prompt_generation.py: Script to generate prompts for the LLM-based assistants. The input of this script is the queries.csv file, and the output is the queries_filled.csv file.
3_gen_sheet_sources_extraction.py: Script to split the external links provided by the LLM, one for each row. The input of this script is the queries_filled.csv file, and the output is the external_link.csv file.
4_ext_link_snippet_extraction.py: Script to extract the snippets from Web URLs. It only works for the most popular domains. The input of this script is the external_links.csv file, and the output is the external_links_filled.csv file.
5_top_n_link_SearchEngine.py: Script to perform top-N link search using the corresponding search engines (Google Search and Bing). The input of this script is the queries_filled.csv file. It executes the browser_bot.scpt. The output is the topN_links_se.csv file.
browser_bot.scpt: Script for browser automation (AppleScript).
6_se_vs_llm.py: Script to compare (using the Jaccard metric) the links returned by the corresponding search engines with those provided by the LLM-based assistants. The input of this script is the external_links.csv file and the topN_links_se.csv file. The output is the jaccaard_analysis.csv file.
7_results_manual_analysis.py: Script to extract results and statistical analyses performed on the manual analysis and reported in the tables in the paper.
8_merge_gen_source_snippets.py: This script takes as input: {llm}/queries.csv and {llm}/external_link_filled.csv to merge them and generates an expanded one, i.e., one in which we have on each line a snippet extracted from the source, this will be the input of our final script for clone detection. The output is the merged_llm_ext_link.csv file.
9_clone_detection.py: Script to perform clone detection. The input of this script is the merged_llm_ext_link.csv file, and the output is the clone_detection_output.csv file.
10_cosine_sim_check.py: Script to compute the code snippets' cosine similarity. The script takes as input the tokenized files from the {llm}_tokens directory. The output is the cosine_sim_output.csvfile.
11_merger_clone_cosine.py: Script to merge clone detection and cosine similarity results. The input of this script is the clone_detection_output.csv and the cosine_sim_output.csv files, and the output is the merged_clone_cosine.csv file.
12_merge_manual_quantitative_analysis.py: Script to merge manual and quantitative analysis results. The inputs of this script are the manual_analysis.csv and the merged_clone_cosine.csvfiles, and the output is the merged_analysis.csv file.
13_sample_for_COT_analysis.py: Script to collect the sample of queries on which we perform the chain of thought analysis, the output is the sample_queries.csv file.
14_llm_vs_csn.py: Script to check the overlap between the links provided by the LLM and the one associated with the related query in the CodeSearchNet dataset. The input of this script is the queries.csvand queries_noTrivial.csv file.
cloningGraph.R: R script to generate the cloning graph. The input of this script is the merged_analysis.csv file.
创建时间:
2025-01-17



