# The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Discussions: ,
Translations: ,

The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines (It’s been referred to as , referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks).

(ULM-FiT has nothing to do with Cookie Monster. But I couldn’t think of anything else..)

one of the latest milestones in this development is the of , an event as marking the beginning of a new era in nlp. bert is a model that broke several records for how well models can handle language-based tasks. soon after the release of the paper describing the model, the team also open-sourced the code of the model, and made available for download versions of the model that were already pre-trained on massive datasets. this is a momentous development since it enables anyone building a machine learning model involving language processing to use this powerhouse as a readily-available component – saving the time, energy, knowledge, and resources that would have gone to training a language-processing model from scratch.

The two steps of how BERT is developed. You can download the model pre-trained in step 1 (trained on un-annotated data), and only worry about fine-tuning it for step 2. [ for book icon].

BERT builds on top of a number of clever ideas that have been bubbling up in the NLP community recently – including but not limited to (by and ), (by and researchers from and ), (by fast.ai founder and ), the (by OpenAI researchers , , , and ), and the Transformer ().

## Example: Sentence Classification

the most straight-forward way to use bert is to use it to classify a single piece of text. this model would look like this:

to train such a model, you mainly have to train the classifier, with minimal changes happening to the bert model during the training phase. this training process is called fine-tuning, and has roots in and ulmfit.

for people not versed in the topic, since we’re talking about classifiers, then we are in the supervised-learning domain of machine learning. which would mean we need a labeled dataset to train such a model. for this spam classifier example, the labeled dataset would be a list of email messages and a labele (“spam” or “not spam” for each message).

other examples for such a use-case include:

• Sentiment analysis
• Input: Movie/Product review. Output: is the review positive or negative?
• Example dataset:
• Fact-checking
• Input: sentence. Output: “Claim” or “Not Claim”
• More ambitious/futuristic example:
• Input: Claim sentence. Output: “True” or “False”
• is an organization building automatic fact-checking tools for the benefit of the public. Part of their pipeline is a classifier that reads news articles and detects claims (classifies text as either “claim” or “not claim”) which can later be fact-checked (by humans now, by with ML later, hopefully).
• Video: .

## Model Architecture

now that you have an example use-case in your head for how bert can be used, let’s take a closer look at how it works.

the paper presents two model sizes for bert:

• BERT BASE – Comparable in size to the OpenAI Transformer in order to compare performance
• BERT LARGE – A ridiculously huge model which achieved the state of the art results reported in the paper

BERT is basically a trained Transformer Encoder stack. This is a good time to direct you to read my earlier post The Illustrated Transformer which explains the Transformer model – a foundational concept for BERT and the concepts we’ll discuss next.

both bert model sizes have a large number of encoder layers (which the paper calls transformer blocks) – twelve for the base version, and twenty four for the large version. these also have larger feedforward-networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the default configuration in the reference implementation of the transformer in the initial paper (6 encoder layers, 512 hidden units, and 8 attention heads).

### Model Inputs

the first input token is supplied with a special [cls] token for reasons that will become apparent later on. cls here stands for classification.

just like the vanilla encoder of the transformer, bert takes a sequence of words as input which keep flowing up the stack. each layer applies self-attention, and passes its results through a feed-forward network, and then hands it off to the next encoder.

in terms of architecture, this has been identical to the transformer up until this point (aside from size, which are just configurations we can set). it is at the output that we first start seeing how things diverge.

### Model Outputs

Each position outputs a vector of size hidden_size (768 in BERT Base). For the sentence classification example we’ve looked at above, we focus on the output of only the first position (that we passed the special [CLS] token to).

that vector can now be used as the input for a classifier of our choosing. the paper achieves great results by just using a single-layer neural network as the classifier.

if you have more labels (for example if you’re an email service that tags emails with “spam”, “not spam”, “social”, and “promotion”), you just tweak the classifier network to have more output neurons that then pass through softmax.

## Parallels with Convolutional Nets

for those with a background in computer vision, this vector hand-off should be reminiscent of what happens between the convolution part of a network like vggnet and the fully-connected classification portion at the end of the network.

## A New Age of Embedding

these new developments carry with them a new shift in how words are encoded. up until now, word-embeddings have been a major force in how leading nlp models deal with language. methods like word2vec and glove have been widely used for such tasks. let’s recap how those are used before pointing to what has now changed.

### Word Embedding Recap

For words to be processed by machine learning models, they need some form of numeric representation that models can use in their calculation. Word2Vec showed that we can use a vector (a list of numbers) to properly represent words in a way that captures semantic or meaning-related relationships (e.g. the ability to tell if words are similar, or opposites, or that a pair of words like “Stockholm” and “Sweden” have the same relationship between them as “Cairo” and “Egypt” have between them) as well as syntactic, or grammar-based, relationships (e.g. the relationship between “had” and “has” is the same as that between “was” and “is”).

the field quickly realized it’s a great idea to use embeddings that were pre-trained on vast amounts of text data instead of training them alongside the model on what was frequently a small dataset. so it became possible to download a list of words and their embeddings generated by pre-training with word2vec or glove. this is an example of the glove embedding of the word “stick” (with an embedding vector size of 200)

The GloVe word embedding of the word "stick" - a vector of 200 floats (rounded to two decimals). It goes on for two hundred values.

### ELMo: Context Matters

If we’re using this GloVe representation, then the word “stick” would be represented by this vector no-matter what the context was. “Wait a minute” said a number of NLP researchers (, , and yet again ), “stick”” has multiple meanings depending on where it’s used. Why not give it an embedding based on the context it’s used in – to both capture the word meaning in that context as well as other contextual information?”. And so, contextualized开心时时彩计划软件下载 word-embeddings were born.

Contextualized word-embeddings can give words different embeddings based on the meaning they carry in the context of the sentence. Also,

instead of using a fixed embedding for each word, elmo looks at the entire sentence before assigning each word in it an embedding. it uses a bi-directional lstm trained on a specific task to be able to create those embeddings.

elmo provided a significant step towards pre-training in the context of nlp. the elmo lstm would be trained on a massive dataset in the language of our dataset, and then we can use it as a component in other models that need to handle language.

what’s elmo’s secret?

ELMo gained its language understanding from being trained to predict the next word in a sequence of words - a task called Language Modeling开心时时彩计划软件下载. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.

A step in the pre-training process of ELMo: Given “Let’s stick to” as input, predict the next most likely word – a language modeling task. When trained on a large dataset, the model starts to pick up on language patterns. It’s unlikely it’ll accurately guess the next word in this example. More realistically, after a word such as “hang”, it will assign a higher probability to a word like “out” (to spell “hang out”) than to “camera”.

elmo actually goes a step further and trains a bi-directional lstm – so that its language model doesn’t only have a sense of the next word, but also the previous word.

on ELMo

## ULM-FiT: Nailing down Transfer Learning in NLP

nlp finally had a way to do transfer learning probably as well as computer vision could.

## The Transformer: Going beyond LSTMs

The Encoder-Decoder structure of the transformer made it perfect for machine translation. But how would you use it for sentence classification? How would you use it to pre-train a language model that can be fine-tuned for other tasks (downstream开心时时彩计划软件下载 tasks is what the field calls those supervised-learning tasks that utilize a pre-trained model or component).

## OpenAI Transformer: Pre-training a Transformer Decoder for Language Modeling

it turns out we don’t need an entire transformer to adopt transfer learning and a fine-tunable language model for nlp tasks. we can do with just the decoder of the transformer. the decoder is a good choice because it’s a natural choice for language modeling (predicting the next word) since it’s built to mask future tokens – a valuable feature when it’s generating a translation word by word.

The OpenAI Transformer is made up of the decoder stack from the Transformer

The OpenAI Transformer is now ready to be trained to predict the next word on a dataset made up of 7,000 books.

## Transfer Learning to Downstream Tasks

How to use a pre-trained OpenAI transformer to do sentence clasification

## BERT: From Decoders to Encoders

“hold my beer”, said r-rated bert.

“we’ll use transformer encoders”, said bert.

“we’ll use masks”, said bert confidently.

BERT's clever language modeling task masks 15% of words in the input and asks the model to predict the missing word.

finding the right task to train a transformer stack of encoders is a complex hurdle that bert resolves by adopting a “masked language model” concept from earlier literature (where it’s called a cloze task).

if you look back up at the input transformations the openai transformer does to handle different tasks, you’ll notice that some tasks require the model to say something intelligent about two sentences (e.g. are they simply paraphrased versions of each other? given a wikipedia entry as input, and a question regarding that entry as another input, can we answer that question?).

The second task BERT is pre-trained on is a two-sentence classification task. The tokenization is oversimplified in this graphic as BERT actually uses WordPieces as tokens rather than words --- so some words are broken down into smaller chunks.

the bert paper shows a number of ways to use bert for different tasks.

### BERT for feature extraction

the fine-tuning approach isn’t the only way to use bert. just like elmo, you can use the pre-trained bert to create contextualized word embeddings. then you can feed these embeddings to your existing model – a process the paper shows yield results not far behind fine-tuning bert on a task such as named-entity recognition.

which vector works best as a contextualized embedding? i would think it depends on the task. the paper examines six choices (compared to the fine-tuned model which achieved a score of 96.4):

## Take BERT out for a spin

the best way to try out bert is through the notebook hosted on google colab. if you’ve never used cloud tpus before, this is also a good starting point to try them as well as the bert code works on tpus, cpus and gpus as well.

the next step would be to look at the code in the :

• The model is constructed in (class BertModel) and is pretty much identical to a vanilla Transformer encoder.
• is an example of the fine-tuning process. It also constructs the classification layer for the supervised model. If you want to construct your own classifier, check out the create_model() method in that file.

• several pre-trained models are available for download. these span bert base and bert large, as well as languages such as english, chinese, and a multi-lingual model covering 102 languages trained on wikipedia.

• BERT doesn’t look at words as tokens. Rather, it looks at WordPieces. is the tokenizer that would turns your words into wordPieces appropriate for BERT.

## Acknowledgements

Written on December 3, 2018