The Judge Effect in Two-Round Legal Debate on LegalBench

Authors

  • David Johnson University of South Florida
  • Onur Bilgin University of South Florida
  • John Licato

DOI:

https://doi.org/10.32473/flairs.39.1.141705

Keywords:

multi-agent debate, legal reasoning, large language models, LLM-as-a-judge, LegalBench

Abstract

Large language models can produce fluent, plausible legal analysis while still misapplying rules and outputting incorrect labels. Such errors are especially problematic in legal reasoning tasks, where they can be persuasive to non-specialist observers. In this paper, we use the term ``judge effect'' for the paired difference between an advocates-only protocol (NoJ) and a judge-augmented protocol (Judge LLM), and we study this effect in two-round legal debate on LegalBench contracts. Compared to an advocates-only baseline, the Judge LLM improves accuracy on contracts in all three same-model setups. We also report token cost explicitly.

Downloads

Published

06-05-2026

How to Cite

Johnson, D., Bilgin, O., & Licato, J. (2026). The Judge Effect in Two-Round Legal Debate on LegalBench. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141705