Sudoku Sage
Evaluating Correctness of LLM-Generated Moves as a Constraint Satisfaction Task
DOI:
https://doi.org/10.32473/flairs.39.1.141841Abstract
Large Language Models (LLMs) frequently fail on constraint satisfaction problems where correctness is binary and violations are immediately detectable. The popular game Sudoku is an example of this type of problem and provides a useful test case for evaluating such failures, as every proposed move must obey strict row, column, and subgrid constraints. In this work, we evaluate the correctness of LLM-generated Sudoku moves across puzzles of varying difficulty, where difficulty is defined by the number of missing cells and their distribution across the grid. The model is prompted to propose a single candidate move given only a textual representation of the current board, with no solver-derived information, verification, or feedback provided at inference time. Model performance is measured as the fraction of proposed moves that match a verified solution. Our results show that move correctness is strongly dependent on puzzle sparsity. Accuracy remains high for low-sparsity puzzles, where constraints are explicit and many moves are forced, but degrades sharply as sparsity increases and the space of plausible candidate moves expands. These findings characterize a clear limitation of ungrounded LLM prompting, in which the model is asked to propose a move given only the current board state without access to solver-derived constraints, verification, or feedback, and highlight the challenges posed by under-determined decision settings.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Aidan Gillespie, Vladimir Serov, Drew Phelps, Amr Hilal

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.