Background: Generative AI has achieved rapid and widespread acclaim over a short period since the inception of recent models that have opened up opportunities not possible before. Large Language Models (LLMs), a subset of generative AI, have become an essential part of code generation for software development. However, there is always a risk that the generated code does not fulfill the programmer's intent and contains faults or bugs that can go unnoticed. To that end, we propose that verification of generated code should increase its quality and trust.
Objectives: This thesis aims to research generation of code that is both functionally correct and verifiable by implementing and evaluating four prompting approaches and a reinforcement learning solution to increase robustness within code generation, using unit-test and verification rewards.
Methods: We used a Rapid Literature Review (RLR) and Design Science methodology to get a solid overview of the current state of robust code generation. From the RLR and related works, we evaluated the following four prompting approaches: Base prompt, Documentation prompting, In-context learning, and Documentation + In-context learning on the two datasets: MBPP and HumanEval. Moreover, we fine-tuned one model using Proximal Policy Optimization (PPO) for the novel task.
Results: We measured the functional correctness and static verification success rates, amongst other metrics, for the four proposed approaches on eight model configurations, including the PPO fine-tuned LLM. Our results show that for the MBPP dataset, on average, In-context learning had the highest functional correctness at 29.4% pass@1, Documentation prompting had the highest verifiability at 8.48% verfiable@1, and finally, In-context learning had the highest functionally correct verifiable code at 3.2% pass@1 & verifiable@1. Moreover, the PPO fine-tuned model showed an overall increase in performance across all approaches compared to the pre-trained base model.
Conclusions: We found that In-context learning on the PPO fine-tuned model yielded the best overall results across most metrics compared to the other approaches. The PPO fine-tuned with In-context learning resulted in 32.0% pass@1, 12.8% verifiable@1, and 5.0% pass@1 & verifiable@1. Documentation prompting was better for verifable@1 on MBPP. However, it did not perform as well for the other metrics. Documentation prompting + In-context learning was performance-wise between Documentation prompting and In-context learning, while Base prompt performed the worst overall. For future work, we envision several improvements to PPO training, including but not limited to training on Nagini documentation and utilizing expert iteration to create supervised fine-tuning datasets to improve the model iteratively.