A CSV dataset for software code clones testing and the code to test them
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14926555
下载链接
链接失效反馈官方服务:
资源简介:
This is a benchmark dataset based on BigCloneBench from https://github.com/jeffsvajlenko/BigCloneEval/ itself including https://github.com/clonebench/BigCloneBench and ijadataset source code files.
BigCloneBench
The license of https://github.com/clonebench/BigCloneBench is originally: CC-BY-NC-ND-4.0
The data is distributed originally in an h2 database. The data was created by using the h2 database engine to export the original h2 tables to CSV verbatim. All CSVs are compressed with zstandard.
In addition, CLONED_FUNCTIONS_SIM_ALL.csv.zst was created with this query:
SELECT * FROM CLONES as C, FUNCTIONS as F1, FUNCTIONS AS F2 WHERE FUNCTION_ID_ONE = F1.ID AND FUNCTION_ID_TWO = F2.ID8584153 rowsIn addition, CLONED_FUNCTIONS_SIM_05.csv.zst was created with this query:SELECT * FROM CLONES as C, FUNCTIONS as F1, FUNCTIONS AS F2 WHERE FUNCTION_ID_ONE = F1.ID AND FUNCTION_ID_TWO = F2.ID WHERE SIMILARITY_TOKEN > 0.52720242 rows
The original notice and credits from https://github.com/clonebench/BigCloneBench follow:
Benchmark:
The benchmark is distributed under the Creative Commons, Attribution-NonCommercial-NoDerivatives. This license includes the benchmark database and its derivatives. For attribution, please cite this page, and our publications below. This data is provided free of charge for non-commercial and academic benchmarking and experimentation use. If you would like to contribute to the benchmark, please contact us. If you believe you intended usage may be restricted by the license, please contact us and we can discuss the possibilities.
The credits:
Publications[1] Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy and Mohammad Mamun Mia, "Towards a Big Data Curated Benchmark of Inter-Project Code Clones", In Proceedings of the Early Research Achievements track of the 30th International Conference on Software Maintenance and Evolution (ICSME 2014), 5 pp., Victoria, Canada, September 2014.[2] Jeffrey Svajlenko and Chanchal K. Roy, “Evaluating Clone Detection Tools with BigCloneBench”, In Proceedings of the 31st International Conference on Software Maintenance and Evolution (ICSME 2015), 10 pp., Bremen, Germany, September 2015.[3] Jeffrey Svjalenko and Chanchal K. Roy, "BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench", In Proceedigns of the 32nd International Conference on Software Maintence and Evolution (ICSME 2016)
ContactBenchmark Maintainer: Jeffrey Svajlenko: jeff.svajlenko@gmail.comJudith F. Islam: judithfran@gmail.comIman Keivanloo: iman.keivanloo@queensu.caChanchal K. Roy: chanchal.roy@usask.ca
AcknowledgementsThe following people have provided clone oracling efforts (in no particular order): Judith F. Islam Mohammad Mamun Mia Graeme Daly Jeffrey Svajlenko Chanchal Roy Muhammad Asaduzzamn Shamima Yeasmin Manishankar Mondal Mike Hoffert
IJADataSet
The IJA dataset is composed of the java files used in the BigCloneBench and zstd-compressed in ijadataset-2016.tar.zst
The license is assumed to be the same as the https://github.com/clonebench/BigCloneBench , originally: CC-BY-NC-ND-4.0 but each file may have other open source licenses.The notice from https://github.com/clonebench/BigCloneBench is:
IJaDataset: We distribute here IJaDataset 2.0 with additions and modifications for the benchmark. The files contained within were crawled from open-source projects. Their in-file licenses are maintained as-is. Additionally, the benchmark database lists the source of each file, and their detected licensing. IJaDataset 2.0 is from the SECold Project: http://www.secold.org/projects/seclone.
http://www.secold.org/projects/seclone is only visible in the archive and carries no explicit license https://web.archive.org/web/20161231055842/http://www.secold.org/projects/seclone
To address this issue, we also extracted an older, full version of the BigCloneBench dataset that contains license and origin information. This is the bcb-functions-with-licenses.sql.zst file that contains an SQL dump from the Postgres database provided by https://github.com/clonebench/BigCloneBench
The data has not been verified. Therefore we consider this data usable for testing, but not for any training or deployment.
创建时间:
2025-02-26



