Quantifying Modality Contributions via Disentangling Multimodal Representations
DOI:
https://doi.org/10.32473/flairs.39.1.141869Abstract
Quantifying modality contributions in Vision-Language Models (VLMs) remains challenging. Existing approaches rely on perturbation or gradient-based methods, which conflate inherent modality informativeness with model-specific biases and fail to capture complex cross-modal interactions. We address this gap by introducing an information-theoretic framework based on Partial Information Decomposition (PID) that decomposes internal representations into unique, redundant, and synergistic components. Our method operates directly on internal embeddings and derives an inference-only modality contribution metric from unique information scores. Applying our framework to six modern VLMs across six benchmarks, we uncover a persistent imbalance in modality contributions driven by low cross-modal synergy. Analysis reveals that fusion architecture significantly impacts the distribution of unique, redundant, and synergistic information. Our framework provides a scalable diagnostic tool for understanding and improving multimodal integration in vision-language systems.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Padegal Amit, Omkar Mahesh Kahsyap, Namitha Rayasam, Nidhi Shekhar, Surabhi Narayan

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.