Evaluating Mistral 7B Instruct Jailbreak Vulnerabilities
DOI:
https://doi.org/10.32473/flairs.39.1.141801Keywords:
Large Language Models, Artificial IntelligenceAbstract
This paper presents a systematic evaluation of jailbreak vulnerabilities in the Mistral 7B Instruct V3 model using 3,200 text-based adversarial prompts from the JailBreakV-28K benchmark. To address the challenge of accurate jailbreak detection, we implement a multi-classifier ensemble refusal system combining three state-of-the-art refusal classifiers with majority voting, alongside a custom embedding-based refusal analyzer trained to categorize responses across sixteen safety policy domains. Our results reveal that Mistral-7B exhibits substantially higher vulnerability than contemporary models, with an average Attack Success Rate of 74.2% and critical weaknesses in Privacy Violation (91.0%), Child Abuse Content (87.5%), and Political Sensitivity (86.5%). The custom classifier achieved 89.66% validation accuracy in categorizing refusals according to the "cannot" vs. "should not" taxonomy, revealing a balanced distribution between capability-based (51.15%) and policy-based (48.85%) refusals. These findings highlight critical gaps in current safety alignment strategies and demonstrate the importance of ensemble-based refusal classification for reliable security evaluation, providing a framework for targeted defensive improvements against large-scale jailbreak attacks.Downloads
Published
06-05-2026
How to Cite
Jamshidi, S., Ahmady Phoulady, H., & Moni, S. S. (2026). Evaluating Mistral 7B Instruct Jailbreak Vulnerabilities. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141801
Issue
Section
Special Track: Security, Privacy and Ethics in AI
License
Copyright (c) 2026 Sina Jamshidi, Hady Ahmady Phoulady, Shafika Showkat Moni

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.