# The Illustrated Transformer

Discussions: ,
Translations: ,
Watch: MIT’s lecture referencing this post

In the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their offering. So let’s try to break the model apart and look at how it functions.

the transformer was proposed in the paper . a tensorflow implementation of it is available as a part of the package. harvard’s nlp group created a . in this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.

## A High-Level Look

popping open that optimus prime goodness, we see an encoding component, a decoding component, and connections between them.

the encoders are all identical in structure (yet they do not share weights). each one is broken down into two sub-layers:

the outputs of the self-attention layer are fed to a feed-forward neural network. the exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

## Bringing The Tensors Into The Picture

now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

as is the case in nlp applications in general, we begin by turning each input word into a vector using an .

Each word is embedded into a vector of size 512. We'll represent those vectors with these simple boxes.

## Now We’re Encoding!

as we’ve mentioned already, an encoder receives a list of vectors as input. it processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

The word at each position passes through a self-attention process. Then, they each pass through a feed-forward neural network -- the exact same network with each vector flowing through it separately.

## Self-Attention at a High Level

don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. i had personally never came across the concept until reading the attention is all you need paper. let us distill how it works.

The animal didn't cross the street because it was too tired

when the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

as the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

be sure to check out the where you can load a transformer model, and examine it using this interactive visualization.

## Self-Attention in Detail

The first step开心时时彩计划软件下载 in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

notice that these new vectors are smaller in dimension than the embedding vector. their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. they don’t have to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

What are the “query”, “key”, and “value” vectors?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

The third and forth steps开心时时彩计划软件下载 are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

this softmax score determines how much how much each word will be expressed at this position. clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

The sixth step开心时时彩计划软件下载 is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

## Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

The self-attention calculation in matrix form

## The Beast With Many Heads

1. it expands the model’s ability to focus on different positions. yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. it would be useful if we’re translating a sentence like “the animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.

2. it gives the attention layer multiple “representation subspaces”. as we’ll see next, with multi-headed attention we have not only one, but multiple sets of query/key/value weight matrices (the transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). each of these sets is randomly initialized. then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

## Representing The Order of The Sequence Using Positional Encoding

one thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.

to address this, the transformer adds a vector to each input embedding. these vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. the intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into q/k/v vectors and during dot-product attention.

To give the model a sense of the order of the words, we add positional encoding vectors -- the values of which follow a specific pattern.

if we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

A real example of positional encoding with a toy embedding size of 4

in the following figure, each row corresponds the a positional encoding of a vector. so the first row would be the vector we’d add to the embedding of the first word in an input sequence. each row contains 512 values – each with a value between 1 and -1. we’ve color-coded them so the pattern is visible.

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors.

the formula for positional encoding is described in the paper (section 3.5). you can see the code for generating positional encodings in . this is not the only possible method for positional encoding. it, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).

## The Residuals

one detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a step.

if we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:

## The Decoder Side

After finishing the encoding phase, we begin the decoding phase. Each step in the decoding phase outputs an element from the output sequence (the English translation sentence in this case).

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

the self attention layers in the decoder operate in a slightly different way than the one in the encoder:

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf开心时时彩计划软件下载) before the softmax step in the self-attention calculation.

## The Final Linear and Softmax Layer

let’s assume that our model knows 10,000 unique english words (our model’s “output vocabulary”) that it’s learned from its training dataset. this would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. that is how we interpret the output of the model followed by the linear layer.

the softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). the cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

This figure starts from the bottom with the vector produced as the output of the decoder stack. It is then turned into an output word.

## Recap Of Training

during training, an untrained model would go through the exact same forward pass. but since we are training it on a labeled training dataset, we can compare its output with the actual correct output.

to visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “<eos>” (short for ‘end of sentence’)).

The output vocabulary of our model is created in the preprocessing phase before we even begin training.

Example: one-hot encoding of our output vocabulary

## The Loss Function

what this means, is that we want the output to be a probability distribution indicating the word “thanks”. but since this model is not yet trained, that’s unlikely to happen just yet.

Since the model's parameters (weights) are all initialized randomly, the (untrained) model produces a probability distribution with arbitrary values for each cell/word. We can compare it with the actual output, then tweak all the model's weights using backpropagation to make the output closer to the desired output.

but note that this is an oversimplified example. more realistically, we’ll use a sentence longer than one word. for example – input: “je suis étudiant” and expected output: “i am a student”. what this really means, is that we want our model to successively output probability distributions where:

• Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
• The first probability distribution has the highest probability at the cell associated with the word “i”
• The second probability distribution has the highest probability at the cell associated with the word “am”
• And so on, until the fifth output distribution indicates ‘<end of sentence>’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.

The targeted probability distributions we'll train our model against in the training example for one sample sentence.

Hopefully upon training, the model would output the right translation we expect. Of course it's no real indication if this phrase was part of the training dataset (see: ). Notice that every position gets a little bit of probability even if it's unlikely to be the output of that time step -- that's a very useful property of softmax which helps the training process.

now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. that’s one way to do it (called greedy decoding). another way to do it would be to hold on to, say, the top two words (say, ‘i’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘i’, and another time assuming the first output position was the word ‘a’, and whichever version produced less error considering both positions #1 and #2 is kept. we repeat this for positions #2 and #3…etc. this method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). these are both hyperparameters that you can experiment with.

## Go Forth And Transform

• Read the paper, the Transformer blog post (), and the .
• Watch walking through the model and its details
• Play with the
• Explore the .

## Acknowledgements

thanks to , , , , , and for providing feedback on earlier versions of this post.

please hit me up on for any corrections or feedback.

Written on June 27, 2018