Beyond Accuracy

Performance and Behavioral Evaluation of Multimodal AI for Suspicious Aerial Traffic Monitoring

Authors

  • Gabriel Dietzsch INPE / IEAv
  • Elcio Hideiti Shiguemori

DOI:

https://doi.org/10.32473/flairs.39.1.141390

Keywords:

Multimodal AI, Aerial Traffic Monitoring, Behavioral Evaluation, Confidence Calibration, Suspicious Flight Detection, Human-in-the-Loop

Abstract

This paper evaluates multimodal AI models (Gemini and ChatGPT) for visual-based aerial trajectory classification, comparing them against a neural network baseline. Beyond standard metrics, we introduce behavioral criteria to assess reliability in safety-critical contexts. Results show that multimodal AI significantly outperforms the baseline (96\% vs. 75\% accuracy). However, higher accuracy did not equate to safer behavior. The top-performing model displayed systematic overconfidence and conceptual hallucinations, while the second model exhibited "useful doubt," effectively communicating uncertainty in ambiguous cases. We conclude that while high-recall models suit automated pipelines, uncertainty-aware models are superior for human-in-the-loop scenarios. This work demonstrates that behavioral evaluation—including confidence calibration and semantic interpretation—is crucial for deploying multimodal AI in aerial surveillance.

Downloads

Published

06-05-2026

How to Cite

Dietzsch, G., & Shiguemori, E. H. (2026). Beyond Accuracy: Performance and Behavioral Evaluation of Multimodal AI for Suspicious Aerial Traffic Monitoring. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141390

Issue

Section

Special Track: Semantic, Logics, Information Extraction and AI