Quantifying Modality Contributions via Disentangling Multimodal Representations

Authors

  • Padegal Amit PES University
  • Omkar Mahesh Kashyap
  • Namitha Rayasam
  • Nidhi Shekhar
  • Surabhi Narayan

DOI:

https://doi.org/10.32473/flairs.39.1.141869

Abstract

Quantifying modality contributions in Vision-Language Models (VLMs) remains challenging. Existing approaches rely on perturbation or gradient-based methods, which conflate inherent modality informativeness with model-specific biases and fail to capture complex cross-modal interactions. We address this gap by introducing an information-theoretic framework based on Partial Information Decomposition (PID) that decomposes internal representations into unique, redundant, and synergistic components. Our method operates directly on internal embeddings and derives an inference-only modality contribution metric from unique information scores. Applying our framework to six modern VLMs across six benchmarks, we uncover a persistent imbalance in modality contributions driven by low cross-modal synergy. Analysis reveals that fusion architecture significantly impacts the distribution of unique, redundant, and synergistic information. Our framework provides a scalable diagnostic tool for understanding and improving multimodal integration in vision-language systems.

Downloads

Published

06-05-2026

How to Cite

Padegal Amit, Kashyap, O. M., Namitha Rayasam, Nidhi Shekhar, & Surabhi Narayan. (2026). Quantifying Modality Contributions via Disentangling Multimodal Representations. The International FLAIRS Conference Proceedings, 39(1). https://doi.org/10.32473/flairs.39.1.141869