Domain-Specificity of Refusal Representations in Large Language Models
DOI:
https://doi.org/10.32473/flairs.39.1.141620Keywords:
AI Alignment, Interpretability, Activation Engineering, Safety Fine-Tuning, Behavior representationAbstract
Modern Large Language Models are trained using a variety of techniques to reject prompts that lead to harmful output. Recent work has shown that a model's likelihood of refusing a prompt is mediated by a single direction in its activation subspace. We investigate the domain specificity of refusal representations by extracting and comparing refusal directions across distinct knowledge domains, and analyzing how activation magnitude along these directions varies with prompt category. These experiments give insight into how models learn to refuse during post-training. By demonstrating the universality of this refusal direction, we highlight a systemic vulnerability: removing a single geometric feature compromises safety guardrails globally, across all distinct knowledge domains.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Charlie Tucker, Maikel Leon

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.