A Past, Present, and Future of Attention
Attention, in deep learning, is in many ways the same as how humans perceive attention; when you pay attention to something, you place more importance on the subject at hand. Similarly, attention mechanisms in deep learning — whether used for image processing, natural language processing, speech recognition, or something entirely different — place more importance on some inputs by non-uniformly weighting contributions input features.
Though all neural networks end up placing different weights on input features, whether it be through gradient descent or some other method, the mechanism of attention is marginally differently. This mechanism of attention started with the proposal of Sequence To Sequence Models (Seq2Seq) and the encoder-decoder architecture independently by Sutskever et. al (2014)¹ and Cho et. al (2014) ².
The Seq2Seq model was proposed by Sutskever for data in the form of sequences, e.g., sentences and paragraphs, and originally implemented in the task of translating from English to French. The overarching Seq2Seq model consists of two deep Long Short-Term Memories (LSTM), a type of recurrent neural network (RNN), to map the input to a fixed length vector and then decode the vector to the target, essentially an encoder-decoder model. Cho et. al proposed the RNN encoder-decoder which functions in the same way and implemented it on the same task of English to French translation. Both models here improved upon the current translation architecture at the time. Though as Cho notes, problems arise with this model because the fixed-length vector isn’t adequate enough to capture the final hidden state from the encoder. For some tasks, this may be a nonissue, e.g, the sentences provided are shorter or of equal length to those provided during the training period; however for example, in an especially long sentence the gradient may blow up and the network may not be able to transmit sufficient data towards the end of said sentence.
Align and Translate
Bahdanau et. al (2015)³ proposed extending this encoder-decoder architecture to align and translate jointly, essentially searching for a set of positions in the supplied sentence that contains the most relevant information after a new word is translated. Based on associated context vectors and previously generated target words, a new target word is predicted. This removes the problem of trying to compress the entire input sentence to a vector of fixed length, but rather encodes the input into a sequence of vectors and then decodes these vectors as needed during the actual translation process. Here, the network pays more attention to certain vectors at different points during the decoding process.
Xu et. al (2015)⁴ moved away from neural machine translation and instead applied their attention mechanism to the task of captioning an image. Their model is shown below in Figure 2. Results from their attempt are shown below in Figure 3.
Vaswani et. al (2017)⁵ propose a transformer mechanism for attention in which the LSTM or recurrent networks found in the encoders/decoders in other attention models is replaced with multi-head self attention allowing for more computationally efficient processes — whether used for natural language processing or some other task.
Breaking this down, we have two main concepts here: multi-head attention and self attention. Self attention at a high level is how the mechanism is able to relate certain words to others in the context of natural language processing. Multi-head attention is essentially multiple attention layers jointly processing representations at different positions. To make things slightly easier to follow, let’s look at the sentence: “The dog was energetic and it jumped down a flight of stairs”. Self attention here would be used to associate the word “it” with said “dog”. Multi-head attention gives the transformer to use the self attention concept for different words concurrently.
How Does The Transformer Work?
Now that multi-head self attention is a bit clearer, let’s review the architecture of the transformer proposed in Attention is All You Need. Vaswani et. al implement a couple key features: a query, a key, and a value; multi-head self attention, and the encoder and decoder.
Starting with the input, we’re given an (input) sequence and for the sake of simplicity we can consider a sentence as a sequence of words. Now to describe how the encoder-decoder system will actually explain the sequence, it must positionally-encode the sequence. For any given embedding of token, x, at position, i, it’s given a unique positional encoding. Next we’ll implement the self attention layer. As mentioned above, the intra-attention aims to measure the encoding of a word in relation to the encoding of another word in the sequence (important to note that in can be the same word) and gives a new encoding. This is done through the aforementioned query, key, and value matrices. To understand this complex concept, let’s take a simple example: “The dog is black”. Let’s say we have x₁ -> x₄ denoting the embeddings of each word in the input sequence. The concept of self attention wants to know how much each embedding relates to the other embeddings. To know exactly how much any two embeddings are related, the mechanism will follow the process listed below:
- x₁ will query x₂
- x₂ will then provide a key
- to score how related these two embeddings are, take the dot product of the key and query (which will result in a single number)
- x₁ will then repeat this process for all embeddings and the algorithm will perform a softmax (this is done to guarantee all scores are bounded while also maintaining a relative difference)
- this process is then repeated for every word in the sequence
- now each embedding has a value relating to every other word
- now each embedding creates a new value for itself by aggregating all the values as pertains to the other embeddings in the sequence.
Hopefully at this point, the self attention mechanism is slightly more clear. To clear up the multi-head attention part, one can imagine there are multiple sets of the query, key, and values. The multi-head self attention mechanism performs the outlined steps on each set with separate embeddings created. All the embeddings are then concatenated and linearly projected to create a single embedding.
Now with the basic principles clarified, we can dig a little further into the actual architecture.
We can break down the transformer into two smaller subfigures on the left and the right. On the left we have the encoder and on the right, the decoder.
We can see the input sequence embedded and positionally encoded as mentioned previously. Each layer of the encoder (the number of layers specified by Nx on the left of the encoder) contains a multi-head self attention layer which is then summed over and layer normalized and then fed into a feed-forward network. To break that down a little further, layer normalization normalizes inputs across all features as opposed to normalizing features across a batch (as done in batch normalization found in other neural network training methods). The feed-forward neural network is fully-connected and consists of two linear layers with a ReLU (rectified linear unit) activation function. The goal of using the feed-forward network is to map a set of n-dimensional embeddings to another set of n-dimensional embeddings in a latent space common to the language. Each sub-layer adopts a residual connection and is layer normalized and produce an output of dimension = 512.
The decoder architecture is very similar — a stack of layers as directed by Nx to the right of the decoder. Each layer then has two multi-head self attention sub-layers and a fully-connected feed-forward network sub-layer. Each sub-layer also employs a residual connection and layer normalization. The key difference between the encoder and decoder, aside from the number of multi-head attention sub-layers, is that the first multi-head sub-layer is “masked”. Masked self-attention is essentially masking the subsequent positions. This feature is present in the decoder, yet absent from the encoder, for a rather logical reason — when encoding a sequence you want to know what comes after the current position; but in a prediction task, one shouldn't know what comes next. The masking effect is achieved by multiplying all subsequent position embeddings by 0 and only predicting based on embeddings from prior positions.
If one wishes, an annotated version of the notebook showing the transformer architecture in a translation application can be found here with an associated github repository. I had some versioning issues when trying to replicate this and so for ease of replication, I’ve included the versions of python and packages I used here:
- Python: 3.7.7
- Numpy: 1.18.5
- Pytorch: 1.2.0
- Torchtext: 0.4.0
- Matplotlib: 3.0.3
- Seaborn: 0.10.1
- Spacy: 2.0.16
By no means is this a complete exhaustive guide on the transformer and attention and here are some additional resources on the topics:
and The Illustrated Transformer, a particularly insightful blog post by Jay Alammar building the attention mechanism found in the Transformer from the ground up.
The transformer was already a more computationally effective way to utilize attention; however, the attention mechanism must compute similarity scores for each pair of input positions and uses quadratic space and time complexity. Methods such as memory caching, utilizing the caching of representations computed for the previous segment to be reused in an extended context (essentially fixing the need for a maximum fixed length), and sparse attention, which utilizes sparse matrix multiplication to make the process more computationally effective, have been tried. While they do aid the process, neither are without their limitations. Google proposed the performer architecture for attention in late September 2020 that scales linearly, removing the need for the caching for extended sentences or the multiple attention layers that arise by utilizing sparse attention. The performer accomplishes this by essentially estimating the attention mechanisms using an algorithm they denote by FAVOR+ (Fast Attention Via positive Orthogonal Random Features). The performer has been shown to increase efficiency in machine translation, pixel-prediction, and protein sequence modeling. More information about specifics can be found at both Google’s blog post and paper proposing the performer mechanism in  and .
Many consider Attention Is All You Need to be one of the most impactful deep learning papers from 2017, if not the 2010’s as a whole, myself included. Work on attention methods in deep learning and its applications seem to just be getting started. As evidenced by Vaswani et. al, using attention for the task of translation is more effective than prior methods recaped earlier in this post. This paper on utilizing the transformer for image recognition submitted to ICLR 2021 shows significant improvements when using the transformer when compared to the current model of convolutional networks. Given how well transformers seem to work with NLP and image recognition tasks, the logical follow up arises: how well do transformers and attention work when given the task of time series forecasting, as both involve processing sequential data. While several attempts have been made, I have not yet seen an implementation where the transformer architecture is significantly better than current models.
Given advances regarding Attention with both the transformer and most recently, performer mechanisms, it seems as though application possibilities are endless ranging from NLP tasks to protein sequencing and hopefully time series prediction.
As previously mentioned, this is by no means a complete overview of attention and its uses but rather an accumulation and summary of the works I’ve studied to understand the mechanism better. Please refer to the embedded links to the original papers and other sources for a more in-depth learning experience. I’ve also included them in the References section below.
 Sutskever, Ilya, et al. “Sequence to Sequence Learning With Neural Networks.” (2014), Advances in Neural Information Processing Systems.
 Cho, Kyunghyun, et al. “Learning Phrase Representations Using RNN Encoder-Decoder For Statistical Machine Translation.” (2014), Empiral Methods in Natural Language Processing.
 Bahdanau, Dzmitry, et al. “Neural Machine Translation By Jointly Learning To Align and Translate.” (2015), International Conference on Learning Representations.
 Xu, Kelvin, et al. “Show, Attend, and Tell: Neural Image Caption Generation with Attention.” (2015), Proceedings of Machine Learning Research.
 Vaswani, Ashish, et al. “Attention Is All You Need.” (2017), Advances in Neural Information Processing Systems.
 Choromanski, Krzysztof and Colwell, Lucy. (2020). Rethinking Attention with Performers (website) Google AI Blog. Retrieved from https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html.
 Choromanski, Krzysztof, et al. “Rethinking Attention with Performers”