five

A CSV dataset for software code clones testing and the code to test them

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14926555
下载链接
链接失效反馈
官方服务:
资源简介:
This is a benchmark dataset based on BigCloneBench from  https://github.com/jeffsvajlenko/BigCloneEval/ itself including https://github.com/clonebench/BigCloneBench and ijadataset source code files.   BigCloneBench The license of https://github.com/clonebench/BigCloneBench is originally: CC-BY-NC-ND-4.0 The data is distributed originally in an h2 database. The data was created by using the h2 database engine to export the original h2 tables to CSV verbatim. All CSVs are compressed with zstandard. In addition, CLONED_FUNCTIONS_SIM_ALL.csv.zst was created with this query: SELECT * FROM  CLONES as C, FUNCTIONS as F1, FUNCTIONS AS F2 WHERE FUNCTION_ID_ONE = F1.ID AND FUNCTION_ID_TWO = F2.ID8584153 rowsIn addition, CLONED_FUNCTIONS_SIM_05.csv.zst was created with this query:SELECT * FROM  CLONES as C, FUNCTIONS as F1, FUNCTIONS AS F2 WHERE FUNCTION_ID_ONE = F1.ID AND FUNCTION_ID_TWO = F2.ID WHERE SIMILARITY_TOKEN > 0.52720242 rows The original notice and credits from https://github.com/clonebench/BigCloneBench follow: Benchmark: The benchmark is distributed under the Creative Commons, Attribution-NonCommercial-NoDerivatives. This license includes the benchmark database and its derivatives. For attribution, please cite this page, and our publications below. This data is provided free of charge for non-commercial and academic benchmarking and experimentation use. If you would like to contribute to the benchmark, please contact us. If you believe you intended usage may be restricted by the license, please contact us and we can discuss the possibilities. The credits: Publications[1] Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy and Mohammad Mamun Mia, "Towards a Big Data Curated Benchmark of Inter-Project Code Clones", In Proceedings of the Early Research Achievements track of the 30th International Conference on Software Maintenance and Evolution (ICSME 2014), 5 pp., Victoria, Canada, September 2014.[2] Jeffrey Svajlenko and Chanchal K. Roy, “Evaluating Clone Detection Tools with BigCloneBench”, In Proceedings of the 31st International Conference on Software Maintenance and Evolution (ICSME 2015), 10 pp., Bremen, Germany, September 2015.[3] Jeffrey Svjalenko and Chanchal K. Roy, "BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench", In Proceedigns of the 32nd International Conference on Software Maintence and Evolution (ICSME 2016) ContactBenchmark Maintainer: Jeffrey Svajlenko: jeff.svajlenko@gmail.comJudith F. Islam: judithfran@gmail.comIman Keivanloo: iman.keivanloo@queensu.caChanchal K. Roy: chanchal.roy@usask.ca AcknowledgementsThe following people have provided clone oracling efforts (in no particular order):    Judith F. Islam    Mohammad Mamun Mia    Graeme Daly    Jeffrey Svajlenko    Chanchal Roy    Muhammad Asaduzzamn    Shamima Yeasmin    Manishankar Mondal    Mike Hoffert   IJADataSet The IJA dataset is composed of the java files used in the BigCloneBench and zstd-compressed in  ijadataset-2016.tar.zst The license is assumed to be the same as the https://github.com/clonebench/BigCloneBench , originally: CC-BY-NC-ND-4.0 but each file may have other open source licenses.The notice from https://github.com/clonebench/BigCloneBench is: IJaDataset: We distribute here IJaDataset 2.0 with additions and modifications for the benchmark. The files contained within were crawled from open-source projects. Their in-file licenses are maintained as-is. Additionally, the benchmark database lists the source of each file, and their detected licensing. IJaDataset 2.0 is from the SECold Project: http://www.secold.org/projects/seclone. http://www.secold.org/projects/seclone is  only visible in the archive and carries no explicit license https://web.archive.org/web/20161231055842/http://www.secold.org/projects/seclone To address this issue, we also extracted an older, full version of the BigCloneBench dataset that contains license and origin information. This is the bcb-functions-with-licenses.sql.zst file that contains an SQL dump from the Postgres database provided by https://github.com/clonebench/BigCloneBench The data has not been verified. Therefore we consider this data usable for testing, but not for any training or deployment.
创建时间:
2025-02-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作