A Preliminary Empirical Study of Large Language Models for Grading Debugging Problems in Programming Education
DOI:
https://doi.org/10.32473/flairs.39.1.141603Abstract
Debugging problems are essential for assessing code semantic understanding, yet grading these heterogeneous responses is labor-intensive and prone to inconsistency. This poster presents a preliminary empirical study evaluating five Large Language Models (LLMs)—ChatGPT, Claude, Gemini, Grok, and DeepSeek—as automated grading assistants. Using authentic student submissions from two university Python courses, we compare LLM performance against rubric-based human benchmarks using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Pearson correlation. Results show all models achieve strong correlation (γ > 0.90), indicating reliable preservation of student rankings. While variance in partially correct solutions persists, the findings suggest LLMs are effective for preliminary scoring and triage, provided human oversight is maintained to mitigate occasional grading deviations.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Qixiang Pang, Linrui Zhang, Belinda Copus, Shan Du

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.