Have (A)I Seen this Before?

Exploring LLM Metacognition Using Self-Assessed Rankings and Scoring

Authors

  • Anura Deshpande University of Southern California
  • Celine Cerezci
  • Benjamin Nye
  • Mark G. Core
  • Suvaditya Mukherjee
  • Joshua Shay Kricheli

DOI:

https://doi.org/10.32473/flairs.39.1.141862

Keywords:

Large Language Model, Metacognition, Self-Assessment, Evaluation, Learning, Biologically Inspired Computing

Abstract

Large Language Models (LLMs) commonly report high confidence, even in domains where their underlying knowledge or training data is limited. This mismatch can negatively impact model reliability, particularly affecting educational applications where users may not recognize errors. To detect these knowledge gaps, LLM knowledge must be assessed after training. In this work, we compare LLM prompts to self-assess knowledge of content in two ways: rank-ordering and direct confidence scoring (e.g., 1-5). For human metacognition, rankings or A/B comparisons are more reliable, so we hypothesize that LLMs’ rankings may also be more effective than scores. We compare LLM-generated Overall Rankings and confidence scores for 15 topics against two external estimates of LLM knowledge: expert human ratings and search result counts from Bing, Google, and Wikipedia. We also consider Anchored Rankings in which each document to be rated by the LLM is compared to a set of documents with known expert scores. Comparing across different document representations and different LLMs, Spearman correlations with expert ratings are generally: positive and relatively high with Anchored Rankings having the highest correlation (ρ ranging from 0.74 to 1.0) followed by Overall Rankings and then confidence scores. In contrast, search-based signals have a weaker and variable alignment suggesting that web popularity is a noisy signal for estimating LLM familiarity with content. Overall, these findings suggest that relative self-assessment through rankings provides an interpretable signal of LLM self-knowledge. This can be used to select specialized prompts or workflows for topics where an LLM has less knowledge.

Downloads

Published

06-05-2026

How to Cite

Deshpande, A., Cerezci, C., Nye, B., Core, M. G., Mukherjee, S., & Kricheli, J. S. (2026). Have (A)I Seen this Before? Exploring LLM Metacognition Using Self-Assessed Rankings and Scoring. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141862