How GPT3 Works - Visualizations and Animations

The tech world is abuzz with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet completely reliable for most businesses to put in front of their customers, these models are showing sparks of cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems. Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works.

A trained language model generates text. We can optionally pass it some text as input, which influences its output.

The output is generated from what the model “learned” during its training period where it scanned vast amounts of text.

GPT3 Language Model Overview

Training is the process of exposing the model to lots of text. That process has been completed. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.

GPT3 Training Language Model

The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence at the top.

GPT3 Training Examples Sliding Window

The model is presented with an example. We only show it the features and ask it to predict the next word.

Repeat millions of times

GPT3 Training Step Backpropagation

Now let’s look at these same steps with a bit more detail.

GPT3 actually generates output one token at a time (let’s assume a token is a word for now).

GPT3 Generate Tokens Output

Please note: This is a description of how GPT-3 works and not a discussion of what is novel about it (which is mainly the ridiculously large scale). The architecture is a transformer decoder model based on this paper https://arxiv.org/pdf/1801.10198.pdf.

GPT3 is MASSIVE. It encodes what it learns from training in 175 billion numbers (called parameters). These numbers are used to calculate which token to generate at each run.

GPT3 Parameters Weights

These numbers are part of hundreds of matrices inside the model. Prediction is mostly a lot of matrix multiplication.

In my Intro to AI on YouTube, I showed a simple ML model with one parameter. A good start to unpack this 175B monstrosity.

GPT3 Generate Output Context Window

High-level steps:

  1. Convert the word to a vector (list of numbers) representing the word
  2. Compute prediction
  3. Convert resulting vector to word
GPT3 Embedding

The important calculations of the GPT3 occur inside its stack of 96 transformer decoder layers.

See all these layers? This is the “depth” in “deep learning”.

GPT3 Processing Transformer Blocks

You can see a detailed explanation of everything inside the decoder in my blog post The Illustrated GPT2.

The difference with GPT3 is the alternating dense and sparse self-attention layers.

GPT3 Tokens Transformer Blocks

My assumption is that the priming examples and the description are appended as input, with specific tokens separating examples and the results. Then fed into the model.

GPT3 Generating React Code Example

Fine-tuning actually updates the model’s weights to make the model better at a certain task.

GPT3 Fine Tuning