Error Analysis of Agentic Tool-Augmented Reasoning in LLMs on NeurIPS CURE-Bench Challenge
DOI:
https://doi.org/10.32473/flairs.39.1.141859Keywords:
AI in healthcare, Healthcare Informatics, Agentic AI, Clinical decision support systems, Large Language ModelsAbstract
Agentic tool-augmented large language models are a common design for therapeutic decision support. The model can issue structured calls to external resources during inference, such as drug label endpoints, and then cite the returned evidence. In a clinical workflow, accuracy is necessary but not sufficient. In this paper we report an error analysis of NeurIPS CURE-Bench challenge submission artifacts. This audit builds on our prior plan-act-verify submission, which reached 0.696 accuracy on the hidden test set. We have audited 2,079 test questions using the tool call logs, final answer letters, and free-text explanations.
Our audit highlights three reliability gaps that matter in practice. Tool calling fails at scale. Nearly all failures stem from missing required parameters, with 342,515 such cases, accounting for more than 99 percent of all failures. Duplicated stems showed extreme instability, with 154 out of 155 repeats receiving different letters. We provide a practical audit that is quantitative and implementable with standard logs and without additional annotations. We translate these findings into a deployment audit protocol for healthcare AI systems. The protocol emphasizes interface validation for tool calls, consistency checks on repeated questions, and evidence-first monitoring where tool outcomes are treated as first-class outputs.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Hilmi Demirhan, Ruize Ruize Ma, Tianqi Wang, Wlodek Zadrozny

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.