Evaluating Mistral 7B Instruct Jailbreak Vulnerabilities

Authors

  • Sina Jamshidi Embry–Riddle Aeronautical University
  • Hady Ahmady Phoulady
  • Shafika Showkat Moni

DOI:

https://doi.org/10.32473/flairs.39.1.141801

Keywords:

Large Language Models, Artificial Intelligence

Abstract

This paper presents a systematic evaluation of jailbreak vulnerabilities in the Mistral 7B Instruct V3 model using 3,200 text-based adversarial prompts from the JailBreakV-28K benchmark. To address the challenge of accurate jailbreak detection, we implement a multi-classifier ensemble refusal system combining three state-of-the-art refusal classifiers with majority voting, alongside a custom embedding-based refusal analyzer trained to categorize responses across sixteen safety policy domains. Our results reveal that Mistral-7B exhibits substantially higher vulnerability than contemporary models, with an average Attack Success Rate of 74.2% and critical weaknesses in Privacy Violation (91.0%), Child Abuse Content (87.5%), and Political Sensitivity (86.5%). The custom classifier achieved 89.66% validation accuracy in categorizing refusals according to the "cannot" vs. "should not" taxonomy, revealing a balanced distribution between capability-based (51.15%) and policy-based (48.85%) refusals. These findings highlight critical gaps in current safety alignment strategies and demonstrate the importance of ensemble-based refusal classification for reliable security evaluation, providing a framework for targeted defensive improvements against large-scale jailbreak attacks.

Downloads

Published

06-05-2026

How to Cite

Jamshidi, S., Ahmady Phoulady, H., & Moni, S. S. (2026). Evaluating Mistral 7B Instruct Jailbreak Vulnerabilities. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141801

Issue

Section

Special Track: Security, Privacy and Ethics in AI