An analysis of the intuition behind the notion of key, query and value in the Transformer architecture and why it is used.
A.Recent years have seen the Transformer architecture make waves in the field of natural language processing (NLP), achieving cutting-edge results in a variety of tasks including machine translation, language modeling and text summarization, as well as in other areas. of AI, i.e. Vision, Speech, RL, etc.
Vaswani et al. (2017), introduced the transformer for the first time in their article “Attention is all you need”in which they used the self-attention mechanism without incorporating recurrent connections while the model can selectively focus on specific parts of the input sequences.
In particular, previous sequence models, such as recurrent encoder-decoder models, were limited in their ability to capture long-term dependencies and parallel computations. In fact, just before the Transformers paper was published in 2017, peak performance in most NLP tasks was achieved using RNNs with an attention mechanism on top, so attention sort of existed before transformers. By introducing the multi-head attention mechanism alone and removing the RNN part, the transformer architecture solves these problems by allowing multiple independent attention mechanisms.
In this article, we will review one of the details of this architecture, namely the query, key and values, and attempt to make sense of the intuition used behind this part.
Note that this article assumes that you already know some basic concepts of NLP and deep learning such as integrations, Linear (dense) layersand in general how a simple neural network works.
First, let’s first understand what the attention mechanism is trying to accomplish. And for the sake of simplicity, let’s start with a simple case of sequential data to understand what problem exactly…