In this article, we will implement text-to-image search (allowing us to search for an image via text) and image-to-image search (allowing us to search for an image based on a reference image) using a pre-trained model. The model we will use to calculate the similarity of images and text is inspired by Contrastive Language Image Pre-Training (CLIP), which I talk about in another article.
Who is it useful for? All developers looking to implement image search, data scientists interested in practical applications, or non-technical readers wanting to learn more about AI in practice.
How advanced is this message? This article will guide you in implementing image search as quickly and simply as possible.
Preconditions: Basic coding experience.
This article is a complement to my article on “Contrastive language-image pre-training”. Feel free to check it out if you want a more in-depth understanding of the theory:
CLIP models are trained to predict whether an arbitrary caption belongs to an arbitrary image. We will use this general functionality to create our image search system. Specifically, we will use CLIP’s image and text encoders to condense the inputs into a vector, called an embedding, which can be thought of as a summary of the input.
The idea behind CLIP is that similar texts and images will have similar vector embeddings.