A team of software engineering researchers from the Cheriton School of Computer Science has received an ACM SIGSOFT Distinguished Paper Award at MSR ’24, the 21st International Conference on Mining Software Repositories, held in Lisbon, Portugal. The prestigious award was conferred for their paper titled “Whodunit: Classifying Code as Human Authored or GPT-4 generated — A Case Study on CodeChef Problems.”
Led by recent master’s graduate Joy Idialu with her colleagues Noble Saji Mathews and Rungroj Maipradit, under the direction of Professors Jo Atlee and Mei Nagappan, the research focused on the use of code stylometry features — distinctive patterns and characteristics of programming code — to differentiate between human-authored code and code generated by artificial intelligence.
“Congratulations to Joy and her colleagues on winning an ACM SIGSOFT Distinguished Paper Award,” said University Professor Raouf Boutaba, Director of the Cheriton School of Computer Science. “While using AI assistants to generate code can increase developer productivity significantly, it is crucial for educators to assess if students have used generative AI in their assignments. This research provides important groundwork to help uphold academic integrity in programming courses.”
More about this award-winning research
Artificially intelligent coding assistants like GitHub Copilot, ChatGPT and CodeWhisperer, built on large language models, are changing how programming tasks are performed. These tools can boost developer productivity by suggesting code snippets, bug fixes, refactorings, and test cases. While undoubtedly helpful in introductory programming courses, using coding assistants also raises concerns about academic integrity. Just as with written assignments, students might pass off AI-generated code as their own.
Programming courses already suffer from plagiarism and contract cheating. Current methods to detect plagiarism in student-submitted programs rely on automated similarity comparison tools. However, these tools are unlikely to detect AI-generated code because of its low similarity to student-authored code. Consequently, the goal of the team’s research was to build a classifier that can reliably distinguish between human-authored and AI-generated code. They hypothesized that code stylometry and machine learning classification can be used to distinguish between the two.
The study’s dataset used human-authored code in Python from CodeChef, an online educational and competitive programming platform, and AI-authored code, also in Python, generated by GPT-4. Their classifier outperformed baselines with an F1-score and AUC-ROC score (evaluation metrics to check a classification model’s performance) of 0.91, demonstrating its potential as a preliminary tool for identifying AI-generated code. If the code excluded what are known as gameable features — that is, code that can be easily and strategically changed or avoided with little effort to mask AI-generated code — their classifier had an F1-score and AUC-ROC score almost as high at 0.89.
They also evaluated their classifier on the difficulty of the programming task, finding almost no difference between easier and intermediate problems, with the classifier performing only slightly worse on harder problems, providing strong evidence that code stylometry is a promising approach for distinguishing between GPT-4 generated code and human-authored code.
To learn more about the research on which this article is based, please see Oseremen Joy Idialu, Noble Saji Mathews, Rungroj Maipradit, Joanne M. Atlee, Meiyappan Nagappan. Whodunit: Classifying Code as Human Authored or GPT-4 generated — A Case Study on CodeChef Problems. arXiv preprint arXiv:2403.04013.