MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training
indeed it would be great if the authors did so. I personally found some non-official implementations: