Domain-Specificity of Refusal Representations in Large Language Models

Authors

DOI:

https://doi.org/10.32473/flairs.39.1.141620

Keywords:

AI Alignment, Interpretability, Activation Engineering, Safety Fine-Tuning, Behavior representation

Abstract

Modern Large Language Models are trained using a variety of techniques to reject prompts that lead to harmful output. Recent work has shown that a model's likelihood of refusing a prompt is mediated by a single direction in its activation subspace. We investigate the domain specificity of refusal representations by extracting and comparing refusal directions across distinct knowledge domains, and analyzing how activation magnitude along these directions varies with prompt category. These experiments give insight into how models learn to refuse during post-training. By demonstrating the universality of this refusal direction, we highlight a systemic vulnerability: removing a single geometric feature compromises safety guardrails globally, across all distinct knowledge domains.

Downloads

Published

06-05-2026

How to Cite

Tucker, C., & Leon, M. (2026). Domain-Specificity of Refusal Representations in Large Language Models. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141620