Deduplication Index for Big Code Datasets
收藏IEEE2019-06-27 更新2026-04-17 收录
下载链接:
https://ieee-dataport.org/open-access/deduplication-index-big-code-datasets
下载链接
链接失效反馈官方服务:
资源简介:
Code duplicates in large code corpora have adverse effects on the evaluation and use of machine learning models that rely on them. Most existing corpora suffer from this problem to some extent. This dataset contains a duplication index for some of the existing corpora in Big Code research. The method for collecting this dataset is described in The Adverse Effects of Code Duplication in Machine Learning Models of Code by Allamanis [ArXiV, to appear in SPLASH 2019].
提供机构:
Microsoft Research
创建时间:
2019-06-27



