Shop-The-Room
A Zero-Shot Foundation Model Framework for Visual Discovery in E-Commerce
DOI:
https://doi.org/10.32473/flairs.39.1.141898Keywords:
E-commerce, Recommender System, Zero-Shot, Visual Product Discovery, Grounding-DINO, CLIP, Object Detection, Feature ExtractionAbstract
Visual product discovery systems have become integral to major e-commerce platforms enabling customers to identify visually similar items from complex scene imagery. Traditionally, such systems have relied on a supervised pipeline comprising object detection, feature extraction, and nearest-neighbors retrieval. However, building these systems at scale necessitates frequent and extensive model-training with vast amounts of annotated data which is both cost-prohibitive, and labor-intensive, particularly for small and medium enterprises managing dynamic inventories. The advent of “Pre-trained Foundation Models” characterized by their capability for zero-shot transfer, presents a compelling alternative that eliminates the need for domain-specific model training and labeled annotations. In this work we demonstrate the implementation of a scene-based visual shopping system called Shop-The-Room, utilizing state-of-the-art foundation models at a major US online retailer. We detail the proposed framework, implementation details, pitfalls, and learning outcomes of this endeavor. Finally, we present the results of both quantitative and qualitative evaluations to validate the system’s efficacy in a real-world setting here at Bed Bath & Beyond.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Vaidyanath Areyur Shanthakumar, Clark Barnett, Vipul Mehra, Komson Chanprapan, Ravi Shankar, Tathagata Mukherjee

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.