Specifically, we use a CLIP-based video understanding framework to demonstrate the proposed approach, which dynamically adjusts the model’s representation space using the knowledge distribution ... A ...
This project implements knowledge distillation from DINOv2 (Vision Transformer) to convolutional networks, enabling efficient visual representation learning with reduced computational requirements.
Experimental results demonstrate that CIKD optimizes the distillation process and effectively utilizes the multi-modal information from CLIP, resulting in enhanced knowledge transfer. Extensive ...