Error Analysis of Agentic Tool-Augmented Reasoning in LLMs on NeurIPS CURE-Bench Challenge

Authors

  • Hilmi Demirhan University of North Carolina Wilmington
  • Ruize Ma
  • Tianqi Wang
  • Wlodek Zadrozny

DOI:

https://doi.org/10.32473/flairs.39.1.141859

Keywords:

AI in healthcare, Healthcare Informatics, Agentic AI, Clinical decision support systems, Large Language Models

Abstract

Agentic tool-augmented large language models are a common design for therapeutic decision support. The model can issue structured calls to external resources during inference, such as drug label endpoints, and then cite the returned evidence. In a clinical workflow, accuracy is necessary but not sufficient. In this paper we report an error analysis of NeurIPS CURE-Bench challenge submission artifacts. This audit builds on our prior plan-act-verify submission, which reached 0.696 accuracy on the hidden test set. We have audited 2,079 test questions using the tool call logs, final answer letters, and free-text explanations.

Our audit highlights three reliability gaps that matter in practice. Tool calling fails at scale. Nearly all failures stem from missing required parameters, with 342,515 such cases, accounting for more than 99 percent of all failures. Duplicated stems showed extreme instability, with 154 out of 155 repeats receiving different letters. We provide a practical audit that is quantitative and implementable with standard logs and without additional annotations. We translate these findings into a deployment audit protocol for healthcare AI systems. The protocol emphasizes interface validation for tool calls, consistency checks on repeated questions, and evidence-first monitoring where tool outcomes are treated as first-class outputs.

Downloads

Published

06-05-2026

How to Cite

Demirhan, H., Ma, R., Wang, T., & Zadrozny, W. (2026). Error Analysis of Agentic Tool-Augmented Reasoning in LLMs on NeurIPS CURE-Bench Challenge. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141859

Issue

Section

Special Track: AI in Healthcare Informatics