The Judge Effect in Two-Round Legal Debate on LegalBench
DOI:
https://doi.org/10.32473/flairs.39.1.141705Keywords:
multi-agent debate, legal reasoning, large language models, LLM-as-a-judge, LegalBenchAbstract
Large language models can produce fluent, plausible legal analysis while still misapplying rules and outputting incorrect labels. Such errors are especially problematic in legal reasoning tasks, where they can be persuasive to non-specialist observers. In this paper, we use the term ``judge effect'' for the paired difference between an advocates-only protocol (NoJ) and a judge-augmented protocol (Judge LLM), and we study this effect in two-round legal debate on LegalBench contracts. Compared to an advocates-only baseline, the Judge LLM improves accuracy on contracts in all three same-model setups. We also report token cost explicitly.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 David Johnson, Onur Bilgin, John Licato

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.