# The Illustrated GPT-2 (Visualizing Transformer Language Models)

Discussions: ,

this year, we saw a dazzling application of machine learning. exhibited impressive ability of writing coherent and passionate essays that exceed what we anticipated current language models are able to produce. the gpt-2 wasn’t a particularly novel architecture – it’s architecture is very similar to the decoder-only transformer. the gpt2 was, however, a very large, transformer-based language model trained on a massive dataset. in this post, we’ll look at the architecture that enabled the model to produce its results. we will go into the depths of its self-attention layer. and then we’ll look at applications for the decoder-only transformer beyond language modeling.

My goal here is to also supplement my earlier post, The Illustrated Transformer开心时时彩计划软件下载, with more visuals explaining the inner-workings of transformers, and how they’ve evolved since the original paper. My hope is that this visual language will hopefully make it easier to explain later Transformer-based models as their inner-workings continue to evolve.

Contents

• Part 1: GPT2 And Language Modeling
• What is a Language Model
• Transformers for Language Modeling
• One Difference From BERT
• The Evolution of The Transformer Block
• Crash Course in Brain Surgery: Looking Inside GPT-2
• A Deeper Look Inside
• End of part #1: The GPT-2, Ladies and Gentlemen
• Part 2: The Illustrated Self-Attention
• 1- Create Query, Key, and Value Vectors
• 2- Score
• 3- Sum
• Beyond Language modeling
• Part 3: Beyond Language Modeling
• Machine Translation
• Summarization
• Transfer Learning
• Music Generation

## Part #1: GPT2 And Language Modeling #

so what exactly is a language model?

### What is a Language Model

In The Illustrated Word2vec, we’ve looked at what a language model is – basically a machine learning model that is able to look at part of a sentence and predict the next word. The most famous language models are smartphone keyboards that suggest the next word based on what you’ve currently typed.

one great way to experiment with gpt-2 is using the . it uses gpt-2 to display ten possible predictions for the next word (alongside their probability score). you can select a word then see the next list of predictions to continue writing the passage.

### Transformers for Language Modeling

As we’ve seen in The Illustrated Transformer, the original transformer model is made up of an encoder and decoder – each is a stack of what we can call transformer blocks. That architecture was appropriate because the model tackled machine translation – a problem where encoder-decoder architectures have been successful in the past.

how high can we stack up these blocks? it turns out that’s one of the main distinguishing factors between the different gpt2 model sizes:

### One Difference From BERT

First Law of Robotics
A robot may not injure a human being or, through inaction, allow a human being to come to harm.

the way these models actually work is that after each token is produced, that token is added to the sequence of inputs. and that new sequence becomes the input to the model in its next step. this is an idea called “auto-regression”. this is one of the ideas that .

the gpt2, and some later models like transformerxl and xlnet are auto-regressive in nature. bert is not. that is a trade off. in losing auto-regression, bert gained the ability to incorporate the context on both sides of a word to gain better results. xlnet brings back autoregression while finding an alternative way to incorporate the context on both sides.

### The Evolution of the Transformer Block

#### The Encoder Block

first is the encoder block:

An encoder block from the original transformer paper can take inputs up until a certain max sequence length (e.g. 512 tokens). It's okay if an input sequence is shorter than this limit, we can just pad the rest of the sequence.

#### The Decoder Block

second, there’s the decoder block which has a small architectural variation from the encoder block – a layer to allow it to pay attention to specific segments from the encoder:

one key difference in the self-attention layer here, is that it masks future tokens – not by changing the word to [mask] like bert, but by interfering in the self-attention calculation blocking information from tokens that are to the right of the position being calculated.

#### The Decoder-Only Block

subsequent to the original paper, proposed another arrangement of the transformer block that is capable of doing language modeling. this model threw away the transformer encoder. for that reason, let’s call the model the “transformer-decoder”. this early transformer-based language model was made up of a stack of six transformer decoder blocks:

The decoder blocks are identical. I have expanded the first one so you can see its self-attention layer is the masked variant. Notice that the model now can address up to 4,000 tokens in a certain segment -- a massive upgrade from the 512 in the original transformer.

these blocks were very similar to the original decoder blocks, except they did away with that second self-attention layer. a similar architecture was examined in to create a language model that predicts one letter/character at a time.

### Crash Course in Brain Surgery: Looking Inside GPT-2

Look inside and you will see, The words are cutting deep inside my brain. Thunder burning, quickly burning, Knife of words is driving me insane, insane yeah. ~

let’s lay a trained gpt-2 on our surgery table and look at how it works.

The GPT-2 can process 1024 tokens. Each token flows through all the decoder blocks along its own path.

The simplest way to run a trained GPT-2 is to allow it to ramble on its own (which is technically called generating unconditional samples) – alternatively, we can give it a prompt to have it speak about a certain topic (a.k.a generating interactive conditional samples). In the rambling case, we can simply hand it the start token and have it start generating words (the trained model uses <|endoftext|> as its start token. Let’s call it <s> instead).

the model only has one input token, so that path would be the only active one. the token is processed successively through all the layers, then a vector is produced along that path. that vector can be scored against the model’s vocabulary (all the words the model knows, 50,000 words in the case of gpt-2). in this case we selected the token with the highest probability, ‘the’. but we can certainly mix things up – you know how if you keep clicking the suggested word in your keyboard app, it sometimes can stuck in repetitive loops where the only way out is if you click the second or third suggested word. the same can happen here. gpt-2 has a parameter called top-k that we can use to have the model consider sampling words other than the top word (which is the case when top-k = 1).

in the next step, we add the output from the first step to our input sequence, and have the model make its next prediction:

notice that the second path is the only that’s active in this calculation. each layer of gpt-2 has retained its own interpretation of the first token and will use it in processing the second token (we’ll get into more detail about this in the following section about self-attention). gpt-2 does not re-interpret the first token in light of the second token.

### A Deeper Look Inside

#### Input Encoding

Each row is a word embedding: a list of numbers representing a word and capturing some of its meaning. The size of that list is different in different GPT2 model sizes. The smallest model uses an embedding size of 768 per word/token.

So in the beginning, we look up the embedding of the start token <s>开心时时彩计划软件下载 in the embedding matrix. Before handing that to the first block in the model, we need to incorporate positional encoding – a signal that indicates the order of the words in the sequence to the transformer blocks. Part of the trained model is a matrix that contains a positional encoding vector for each of the 1024 positions in the input.

Sending a word to the first transformer block means looking up its embedding and adding up the positional encoding vector for position #1.

#### A journey up the Stack

the first block can now process the token by first passing it through the self-attention process, then passing it through its neural network layer. once the first transformer block processes the token, it sends its resulting vector up the stack to be processed by the next block. the process is identical in each block, but each block has its own weights in both self-attention and the neural network sublayers.

#### Self-Attention Recap

language heavily relies on context. for example, look at the second law:

Second Law of Robotics
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

i have highlighted three places in the sentence where the words are referring to other words. there is no way to understand or process these words without incorporating the context they are referring to. when a model processes this sentence, it has to be able to know that:

• it refers to the robot
• such orders refers to the earlier part of the law, namely “the orders given it by human beings”
• The First Law refers to the entire First Law

this is what self-attention does. it bakes in the model’s understanding of relevant and associated words that explain the context of a certain word before processing that word (passing it through a neural network). it does that by assigning scores to how relevant each word in the segment is, and adding up their vector representation.

#### Self-Attention Process

self-attention is processed along the path of each token in the segment. the significant components are three vectors:

• Query: The query is a representation of the current word used to score against all the other words (using their keys). We only care about the query of the token we’re currently processing.
• Key: Key vectors are like labels for all the words in the segment. They’re what we match against in our search for relevant words.
• Value: Value vectors are actual word representations, once we’ve scored how relevant each word is, these are the values we add up to represent the current word.

a crude analogy is to think of it like searching through a filing cabinet. the query is like a sticky note with the topic you’re researching. the keys are like the labels of the folders inside the cabinet. when you match the tag with a sticky note, we take out the contents of that folder, these contents are the value vector. except you’re not only looking for one value, but a blend of values from a blend of folders.

multiplying the query vector by each key vector produces a score for each folder (technically: dot product followed by softmax).

This weighted blend of value vectors results in a vector that paid 50% of its “attention” to the word robot, 30% to the word a, and 19% to the word it. Later in the post, we’ll got deeper into self-attention. But first, let’s continue our journey up the stack towards the output of the model.

#### Model Output

when the top block in the model produces its output vector (the result of its own self-attention followed by its own neural network), the model multiplies that vector by the embedding matrix.

recall that each row in the embedding matrix corresponds to the embedding of a word in the model’s vocabulary. the result of this multiplication is interpreted as a score for each word in the model’s vocabulary.

with that, the model has completed an iteration resulting in outputting a single word. the model continues iterating until the entire context is generated (1024 tokens) or until an end-of-sequence token is produced.

### End of part #1: The GPT-2, Ladies and Gentlemen

and there we have it. a run down of how the gpt2 works. if you’re curious to know exactly what happens inside the self-attention layer, then the following bonus section is for you. i created it to introduce more visual language to describe self-attention in order to make describing later transformer models easier to examine and describe (looking at you, transformerxl and xlnet).

• I used “words” and “tokens” interchangeably. But in reality, GPT2 uses Byte Pair Encoding to create the tokens in its vocabulary. This means the tokens are usually parts of words.
• The example we showed runs GPT2 in its inference/evaluation mode. That’s why it’s only processing one word at a time. At training time, the model would be trained against longer sequences of text and processing multiple tokens at once. Also at training time, the model would process larger batch sizes (512) vs. the batch size of one that evaluation uses.
• I took liberties in rotating/transposing vectors to better manage the spaces in the images. At implementation time, one has to be more precise.
• Transformers use a lot of layer normalization, which is pretty important. We’ve noted a few of these in the Illustrated Transformer, but focused more on self-attentionin this post.
• There are times when I needed to show more boxes to represent a vector. I indicate those as “zooming in”. For example:

## Part #2: The Illustrated Self-Attention #

Earlier in the post we showed this image to showcase self-attention being applied in a layer that is processing the word it:

in this section, we’ll look at the details of how that is done. note that we’ll look at it in a way to try to make sense of what happens to individual words. that’s why we’ll be showing many single vectors. the actual implementations are done by multiplying giant matrices together. but i want to focus on the intuition of what happens on a word-level here.

self-attention is applied through three main steps:

1. Create the Query, Key, and Value vectors for each path.
2. For each input token, use its query vector to score against all the other key vectors
3. Sum up the value vectors after multiplying them by their associated scores.

### 1- Create Query, Key, and Value Vectors

let’s focus on the first path. we’ll take its query, and compare against all the keys. that produces a score for each key. the first step in self-attention is to calculate the three vectors for each token path (let’s ignore attention heads for now):

### 2- Score

now that we have the vectors, we use the query and key vectors only for step #2. since we’re focused on the first token, we multiply its query by all the other key vectors resulting in a score for each of the four tokens.

### 3- Sum

we can now multiply the scores by the value vectors. a value with a high score will constitute a large portion of the resulting vector after we sum them up.

The lower the score, the more transparent we're showing the value vector. That's to indicate how multiplying by a small number dilutes the values of the vector.

if we do the same operation for each path, we end up with a vector representing each token containing the appropriate context of that token. those are then presented to the next sublayer in the transformer block (the feed-forward neural network):

now that we’ve looked inside a transformer’s self-attention step, let’s proceed to look at masked self-attention. masked self-attention is identical to self-attention except when it comes to step #2. assuming the model only has two tokens as input and we’re observing the second token. in this case, the last two tokens are masked. so the model interferes in the scoring step. it basically always scores the future tokens as 0 so the model can’t peak to future words:

in matrix form, we calculate the scores by multiplying a queries matrix by a keys matrix. let’s visualize it as follows, except instead of the word, there would be the query (or key) vector associated with that word in that cell:

what this scores table means is the following:

• When the model processes the first example in the dataset (row #1), which contains only one word (“robot”), 100% of its attention will be on that word.
• When the model processes the second example in the dataset (row #2), which contains the words (“robot must”), when it processes the word “must”, 48% of its attention will be on “robot”, and 52% of its attention will be on “must”.
• And so on

let’s get into more detail on gpt-2’s masked attention.

#### Evaluation Time: Processing One Token at a Time

we can make the gpt-2 operate exactly as masked self-attention works. but during evaluation, when our model is only adding one new word after each iteration, it would be inefficient to recalculate self-attention along earlier paths for tokens which have already been processed.

In this case, we process the first token (ignoring <s> for now).

GPT-2 holds on to the key and value vectors of the the a开心时时彩计划软件下载 token. Every self-attention layer holds on to its respective key and value vectors for that token:

Now in the next iteration, when the model processes the word robot, it does not need to generate query, key, and value queries for the a token. It just reuses the ones it saved from the first iteration:

#### GPT-2 Self-attention: 1- Creating queries, keys, and values

Let’s assume the model is processing the word it. If we’re talking about the bottom block, then its input for that token would be the embedding of it + the positional encoding for slot #9:

every block in a transformer has its own weights (broken down later in the post). the first we encounter is the weight matrix that we use to create the queries, keys, and values.

Self-attention multiplies its input by its weight matrix (and adds a bias vector, not illustrated here).

The multiplication results in a vector that’s basically a concatenation of the query, key, and value vectors for the word it.

Multiplying the input vector by the attention weights vector (and adding a bias vector aftwards) results in the key, value, and query vectors for this token.

#### GPT-2 Self-attention: 1.5- Splitting into attention heads

in the previous examples, we’ve looked at what happens inside one attention head. one way to think of multiple attention-heads is like this (if we’re to only visualize three of the twelve attention heads):

#### GPT-2 Self-attention: 2- Scoring

we can now proceed to scoring – knowing that we’re only looking at one attention head (and that all the others are conducting a similar operation):

now the token can get scored against all of keys of the other tokens (that were calculated in attention head #1 in previous iterations):

#### GPT-2 Self-attention: 3- Sum

as we’ve seen before, we now multiply each value with its score, then sum them up, producing the result of self-attention for attention-head #1:

#### GPT-2 Self-attention: 3.5- Merge attention heads

but the vector isn’t ready to be sent to the next sublayer just yet. we need to first turn this frankenstein’s-monster of hidden states into a homogenous representation.

#### GPT-2 Self-attention: 4- Projecting

we’ll let the model learn how to best map concatenated self-attention results into a vector that the feed-forward neural network can deal with. here comes our second large weight matrix that projects the results of the attention heads into the output vector of the self-attention sublayer:

#### GPT-2 Fully-Connected Neural Network: Layer #1

the fully-connected neural network is where the block processes its input token after self-attention has included the appropriate context in its representation. it is made up of two layers. the first layer is four times the size of the model (since gpt2 small is 768, this network would have 768*4 = 3072 units). why four times? that’s just the size the original transformer rolled with (model dimension was 512 and layer #1 in that model was 2048). this seems to give transformer models enough representational capacity to handle the tasks that have been thrown at them so far.

(Not shown: A bias vector)

#### GPT-2 Fully-Connected Neural Network: Layer #2 - Projecting to model dimension

the second layer projects the result from the first layer back into model dimension (768 for the small gpt2). the result of this multiplication is the result of the transformer block for this token.

(Not shown: A bias vector)

and each block has its own set of these weights. on the other hand, the model has only one token embedding matrix and one positional encoding matrix:

## Part 3: Beyond Language Modeling #

### Summarization

this is the task that the first decoder-only transformer was trained on. namely, it was trained to read a wikipedia article (without the opening section before the table of contents), and to summarize it. the actual opening sections of the articles were used as the labels in the training datasest:

the paper trained the model against wikipedia articles, and thus the trained model was able to summarize articles:

### Transfer Learning

in , a decoder-only transformer is first pre-trained on language modeling, then finetuned to do summarization. it turns out to achieve better results than a pre-trained encoder-decoder transformer in limited data settings.

### Music Generation

you might be curious as to how music is represented in this scenario. remember that language modeling can be done through vector representations of either characters, words, or tokens that are parts of words. with a musical performance (let’s think about the piano for now), we have to represent the notes, but also velocity – a measure of how hard the piano key is pressed.

the one-hot vector representation for this input sequence would look like this:

i love a visual in the paper that showcases self-attention in the music transformer. i’ve added some annotations to it here:

"Figure 8: This piece has a recurring triangular contour. The query is at one of the latter peaks and it attends to all of the previous high notes on the peak, all the way to beginning of the piece." ... "[The] figure shows a query (the source of all the attention lines) and previous memories being attended to (the notes that are receiving more softmax probabiliy is highlighted in). The coloring of the attention lines correspond to different heads and the width to the weight of the softmax probability."

if you’re unclear on this representation of musical notes, .

## Resources

• The from OpenAI
• Check out the library from in addition to GPT2, it implements BERT, Transformer-XL, XLNet and other cutting-edge transformer models.