VisualLanguage models learn concepts about images using words
They are usually two pretrained models:
1. LLM encoder decoder generative language model e.g gpt
2. CNN vision model pretrained on image labels e.g resnet
Each models features are feed into two different weights to produce a latent variable for each
The Difference between these latent variables is then used to further train both models reducing the difference and thereby aligning the models
The result is zero shot prediction of image labels for unseen data as the models have learnt visual language concepts
This is very important for all fields of AI especially robotics as it means that models that are say based on vision and action such as in reinforcement learning can be paired with language models such as chatGPT enabling untrained (zero shot) behaviours actions and attendance to objects in the visual field as the language model can begin conceptualise the world in which it inhabits.
This method when paired with a diffusion encoder decoder vision model can acheive startling generative images and video. As with Stable Diffusion.
For example CLIP is trained on Text Image pairs in this way and then an image generator such as diffusion is trained using the difference between the image it creates - as a CLIP latent variable and the accompanying text description CLIP latent variable. Effectivley feeding the output of an image generator into a Visual Language model like this enables you to create stunning images from prompts.