Beyond Accuracy
Performance and Behavioral Evaluation of Multimodal AI for Suspicious Aerial Traffic Monitoring
DOI:
https://doi.org/10.32473/flairs.39.1.141390Keywords:
Multimodal AI, Aerial Traffic Monitoring, Behavioral Evaluation, Confidence Calibration, Suspicious Flight Detection, Human-in-the-LoopAbstract
This paper evaluates multimodal AI models (Gemini and ChatGPT) for visual-based aerial trajectory classification, comparing them against a neural network baseline. Beyond standard metrics, we introduce behavioral criteria to assess reliability in safety-critical contexts. Results show that multimodal AI significantly outperforms the baseline (96\% vs. 75\% accuracy). However, higher accuracy did not equate to safer behavior. The top-performing model displayed systematic overconfidence and conceptual hallucinations, while the second model exhibited "useful doubt," effectively communicating uncertainty in ambiguous cases. We conclude that while high-recall models suit automated pipelines, uncertainty-aware models are superior for human-in-the-loop scenarios. This work demonstrates that behavioral evaluation—including confidence calibration and semantic interpretation—is crucial for deploying multimodal AI in aerial surveillance.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 GABRIEL DIETZSCH, SHIGUEMORI

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.