Please note: This PhD defence will take place online.
Murray Dunne, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Sebastian Fischmeister
Vibe coding is the process of using a Large Language Model (LLM) to iteratively generate software code. It is popular, with 36% of workers at technology companies reporting adoption of generative artificial intelligence for software engineering in 2024 [1]. At this rate of use, LLM-generated code is quickly becoming part of everyday cyber-physical infrastructure. However, LLM-generated code poses threats to dependability, exhibiting faults such as buffer overflows, out-of-bounds writes, integer overflows, and more. In this work, we contribute methods for improving the dependability of these systems in three key parts: fault prevention through mitigating weaknesses in LLM-generated C code, protecting LLM code generation from poisoning attacks, and detecting faults in production embedded systems through power side-channel analysis.
This work begins with an examination and categorization of weaknesses in LLM-generated code for embedded systems. Our findings suggest that LLMs perform poorly at programming tasks involving direct interactions with memory. Scores on existing LLM-generated C benchmarks do not adequately express this difficulty, as these benchmarks do not include sufficiently complex interactions with memory. To support future testing of LLMs, we introduce EmbedEval, a dataset of C coding challenge prompts and tests to provide a benchmark against which LLMs can be evaluated on memory-dependent tasks.
Retrieval Augmented Generation (RAG) is an essential tool for vibe coding, but presents new threats to dependability that we address with canary functions. If an attacker can cause the RAG to retrieve their crafted documents, they can induce the LLM to generate code with specific weaknesses. To mitigate this attack, we introduce canary functions, a process by which specific functions in the codebase are regenerated and re-tested to determine whether the addition of new documents induces new weaknesses. We show this approach is effective in detecting poisoned documents and suggest metrics for the selection of appropriate canary functions. Canary functions provide a fault prevention mechanism for protecting RAG systems.
Finally, we also approach dependability from the detection perspective using fuzz testing. Fuzz testing is a powerful vulnerability detection technique, but the unavailability of source code makes it infeasible in most embedded settings. We suggest using power side-channel analysis to provide a feedback mechanism to a fuzzer in order to determine if a fuzzing input has caused a new response from the system. We show that responses involving five or more memory-interacting instructions are consistently detectable, and we show that these techniques can be refined to allow the fuzzer to explore new execution paths.
Abstract Citation
[1] Alex Singla, Alexander Sukharevsky, Lareina Yee, Michael Chui, and Bryce Hall. "The state of AI: How organizations are rewiring to capture value", McKinsey & Company, March 2025.