A Preliminary Empirical Study of Large Language Models for Grading Debugging Problems in Programming Education

Authors

DOI:

https://doi.org/10.32473/flairs.39.1.141603

Abstract

Debugging problems are essential for assessing code semantic understanding, yet grading these heterogeneous responses is labor-intensive and prone to inconsistency. This poster presents a preliminary empirical study evaluating five Large Language Models (LLMs)—ChatGPT, Claude, Gemini, Grok, and DeepSeek—as automated grading assistants. Using authentic student submissions from two university Python courses, we compare LLM performance against rubric-based human benchmarks using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Pearson correlation. Results show all models achieve strong correlation (γ > 0.90), indicating reliable preservation of student rankings. While variance in partially correct solutions persists, the findings suggest LLMs are effective for preliminary scoring and triage, provided human oversight is maintained to mitigate occasional grading deviations.

Downloads

Published

06-05-2026

How to Cite

Pang, Q., Zhang, L., Copus, B., & Du, S. (2026). A Preliminary Empirical Study of Large Language Models for Grading Debugging Problems in Programming Education . The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141603