Artifact for TOSEM Submission: GiantRepair
- I. Introduction
- II. Project Structure
- III. Environment
- IV. How to Run
- V. Ablation Results
- VI. Discussion Results
Automated Program Repair (APR) has garnered significant attention due to its potential to streamline the bug repair process for human developers. Recently, LLM-based APR methods have shown promise in repairing real-world bugs. However, existing APR methods often utilize patches generated by LLMs without further optimization, resulting in reduced effectiveness due to the lack of program-specific knowledge. Furthermore, the evaluations of these APR methods have typically been conducted under the assumption of perfect fault localization, which may not accurately reflect their real-world effectiveness. To address these limitations, this paper introduces an innovative APR approach called GIANTREPAIR. Our approach leverages the insight that LLM-generated patches, although not necessarily correct, offer valuable guidance for the patch generation process. Based on this insight, GIANTREPAIR first constructs patch skeletons from LLM-generated patches to confine the patch space, and then generates high-quality patches tailored to specific programs through context-aware patch generation by instantiating the skeletons. To evaluate the performance of our approach, we conduct two large-scale experiments. The results demonstrate that GIANTREPAIR not only effectively repairs more bugs (an average of 27.78% on Defects4J v1.2 and 23.40% on Defects4J v2.0) than using LLM-generated patches directly, but also outperforms state-of-the-art APR methods by repairing at least 42 and 7 more bugs under perfect and automated fault localization scenarios, respectively.
├── GiantRepair: GiantRepair's Java implementation
├── LLM_Inference: Code to apply LLMs to APR task
│ ├── Models
│ ├── run_apr.py
│ ├── script_runapr.sh
│ ├── test_llm.py
│ └── utils
├── README.md
├── doc
├── results: Specific Results used in paper.
└── d4j-info: Analysisi results of Defects4J and GrowingBugs Dataset
├── filelist.json
├── growing_bugs_filelist.json
├── growing_bugs_single_function.json
├── growing_bugs_single_function_expand.json
├── linelist.json
└── single_function_repair.json
- OS: Linux (Tested on Ubuntu 20.04.6 LTS)
- OpenJDK 1.8.0_382 and OpenJDK 11.0.20.1
- Download and configure Defects4J and ExpressAPR.
- More runtime configurations can be found in the config-file.
- Python==3.9
- transformers==4.33.3
- Defects4J Setting
defects4j checkout -p Chart -v 1b -w ${buggy_program_path}/chart/chart_1_buggy
-
ExpressAPR Setting, shown in Link
-
Modify GiantRepair's setting in configfile
then
java -jar GiantRepair repair -d4j {bugid} -d4jhome {buggy_program_path} -modelname {modelName}
bugid should be like proj_idnum
all in lowercase.
In oreder to study the contribution of various components in GIANTREPAIR to the overall performance, we have set up the following three variants:
- GiantRepair
selectionwill randomly select code elements from the project to fill the code skeletons, rather than being constrained by syntatic rules. - GiantRepair
contextwill test the generated patches in the order of generation, rather than rank by the similarities. - GiantRepair
adaptivewill randomly select modifications from LLM patches, rather than apply coarse-grained modifications.
We conduct the experiment on Defects4J v1.2 single-function bugs, and the results shows in following table:
variant | #Plausible Fixes | #Correct Fixes | %Precision |
---|---|---|---|
GiantRepair |
123 | 46 | 37.40% |
GiantRepair |
129 | 51 | 39.53% |
GiantRepair |
125 | 49 | 39.20% |
GiantRepairori | 135 | 55 | 40.74% |
Thie table shows the numebr of plausible fixes, correct fixes and precision value for each of the three variants. We first observe that just randomly filling code skeletons, we achieve the lowest number of plausible fixes and precision value. And by disable the Context similarity and Adaptive application, these variant also have drop on the number of plausible and correct fixes. As a result, all the components contribute to the overall effectiveness of GiantRepair. GiantRepair can effectively produce more plausible/correct fixes by utilizing LLM-generated patches.
To investigate whether or not GiantRepair is still effective for repairing unique bugs when comparing to the most advanced LLMs, we conducted another experiment with GPT-4.Specifically, we randomly selected ten bugs that were correctly repaired by GIANTREPAIR but cannot by the studied LLMs, and then invoked GPT-4 via API requests to generate 20 patches for each bug. Here is the outcome table:
Bug ids | Closure-19 | Closure-36 | Closure-113 | Lang-57 | Math-27 | Math-85 | Cli-32 | Codec-4 | Compress-1 | Jsoup-33 |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4-1106-preview | ||||||||||
GiantRepair |
In Discussion's Data leakage part, we not only showcase GiantRepair's effectiveness in addressing data leakage concerns by examining the StarCoder training dataset, but we also seek to further substantiate this conclusion. To achieve this, we employed the GrowingBugs dataset for additional experimentation. Remarkably, GiantRepair managed to successfully rectify 10 out of the 51 bugs identified. The detailed data are presented in the tables below:
Project | Bugs#SF | GiantRepair |
---|---|---|
Canvas_api | 2 | 1 |
Dosgi_common | 1 | 1 |
Hono_client | 2 | 0 |
Tika_app | 1 | 1 |
HttpClient5 | 2 | 0 |
JacksonDatatypeJsr310 | 1 | 0 |
JacksonModuleAfterburner | 1 | 1 |
Switchyard_admin | 1 | 1 |
Qpidjms_client | 1 | 0 |
Tiles_api | 1 | 0 |
Tiles_core | 2 | 0 |
Wicket_request | 5 | 0 |
Wicket_util | 4 | 1 |
Wicket_spring | 1 | 0 |
Struts1_core | 2 | 0 |
Wicket_core | 10 | 2 |
Cargo_container | 3 | 0 |
Jcodemodel | 1 | 1 |
Vectorz | 2 | 0 |
Restfixture | 2 | 0 |
Xades4j | 1 | 0 |
Render_app | 1 | 0 |
Leshan_core | 4 | 1 |
Total | 51 | 10 |