Online Appendix of "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance"
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13618978
下载链接
链接失效反馈官方服务:
资源简介:
Abstract
This is the dataset of the paper entitled "Toward Interactive Optimization of Source Code Differences: An Empirical Study of Its Performance", presented at SCAM 2024. It contains information related to the target commits and their attributes, as well as simulation results pertaining to the research questions in the paper.
The target projects and commits in the dataset are based on a prior study: Nugroho, et al.: "How different are different diff algorithms in Git?: Use --histogram for code changes", Empirical Software Engineering, 2020, https://doi.org/10.1007/s10664-019-09772-z
Survey Overview
1. Filtration
We collected attributes related to the changes for filtering and RQ purposes.
Number of lines
Number of changed lines
Similarity distance
Number of mismatch diff area
2. RQ1
We investigate the minimum number of feedback actions needed to correct the initial diffs to the target diffs. Regarding the simulation, two types of heuristic functions (non-admissible and admissible functions) have been used to reduce costs. When the search with the non-admissible heuristic function of the initial state does not match the ideal optimal result, i.e., when there is room for improvement in the number of feedback actions, we applied another A* search with the admissible heuristic function.
We obtained the following results through search:
Number of feedback actions (A* search with non-admissible heuristic)
Number of feedback actions (A* search with admissible heuristic)
3. RQ2
We investigated the various effects that feedbacks have on the diffs by examining the diffs at depth 1 of the search tree. The dataset records the maximum, minimum, median, mean, and standard deviation for each search problem.
The study yielded the following results:
Similarity distance
Number of mismatch diff area
Dataset Columns
The following are the contents represented by the columns in the CSV file and their descriptions.
Column name
Description
project_name
Name of the project associated with the data.
filename
Name of the file being analyzed.
filepath
Path to the file within the project.
commit_id
Commit hash representing the new version of the file.
parent_commit
Commit hash representing the old version of the file.
error_commit
An error occurred when retrieving the commit from the repository.
error_setup
Any error when generating the new and old versions of the file.
error_analyze
An error when collecting information for filtering.
new_loc
Lines of the new version of the source code.
old_loc
Lines of the old version of the source code.
histogram_len
Path length of the diff when using the Histogram algorithm.
myers_len
Path length of the diff when using the Myers algorithm.
histogram-myers#edge
Number of difference edges between Histogram and Myers diff.
histogram-dp#edge
Number of difference edges between Histogram and initial diff.
myers-dp#edge
Number of difference edges between Myers and initial diff.
histogram-myers#area
Number of mismatch diff areas between Histogram and Myers diff.
histogram-dp#area
Number of mismatch diff areas between Histogram and initial diff.
myers-dp#area
Number of mismatch diff areas between Myers and initial diff.
dp#candidate
Number of feedback candidates of initial diff (similarity distance).
#insert
Number of lines added in the change.
#delete
Number of lines deleted in the change.
#change
#insert + #delete.
error_Asearch
An error occurring during A* search with a non-admissible heuristic.
#feedback_A
Number of feedback actions for A* search with a non-admissible heuristic.
time_A
Time taken for A* search with a non-admissible heuristic (ms).
RQ1_error_iteration
An error when iterations exceed the limit (10,000,000) during A* search with an admissible heuristic.
RQ1_error_timeout
An error when the search time exceeds the limit (1,800 seconds).
RQ1_error_other
Other errors encountered during A* search with an admissible heuristic.
RQ1#feedback
Number of feedback actions for A* search with an admissible heuristic.
RQ1#iter
Number of iterations for A* search with an admissible heuristic.
RQ1_time
Time taken for A* search with an admissible heuristic.
RQ2_error_exceed
An error due to exceeding time or iteration limits in RQ2.
RQ2_error_other
Other errors encountered during RQ2.
RQ2#children
Number of children nodes of the initial state (= similarity distance).
RQ2#candidate_min
Minimum similarity distance among the generated diffs.
RQ2#candidate_max
Maximum similarity distance among the generated diffs.
RQ2#candidate_ave
Average of similarity distance among the generated diffs.
RQ2#candidate_median
Median of similarity distance among the generated diffs.
RQ2#candidate_sd
Standard deviation of similarity distance among the generated diffs.
RQ2#area_min
Minimum number of mismatch diff areas among the generated diffs.
RQ2#area_max
Maximum number of mismatch diff areas among the generated diffs.
RQ2#area_ave
Average number of mismatch diff areas among the generated diffs.
RQ2#area_median
Median number of mismatch diff areas among the generated diffs.
RQ2#area_sd
Standard deviation of number of mismatch diff areas among the generated diffs.
is_used
Indicates whether this data is used in the results of RQ1 and RQ2.
创建时间:
2024-08-31



