How GPT3 Works - Visualizations and Animations

The tech world is abuzz with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet completely reliable for most businesses to put in front of their customers, these models are showing sparks of cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems. Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works.

A trained language model generates text. We can optionally pass it some text as input, which influences its output.

The output is generated from what the model “learned” during its training period where it scanned vast amounts of text.

Training is the process of exposing the model to lots of text. That process has been completed. All the experiments you see now are from that one trained model. It was estimated to cost 355 GPU years and cost $4.6m.

The dataset of 300 billion tokens of text is used to generate training examples for the model. For example, these are three training examples generated from the one sentence at the top.

The model is presented with an example. We only show it the features and ask it to predict the next word.

Repeat millions of times

Now let’s look at these same steps with a bit more detail.

GPT3 actually generates output one token at a time (let’s assume a token is a word for now).

Please note: This is a description of how GPT-3 works and not a discussion of what is novel about it (which is mainly the ridiculously large scale). The architecture is a transformer decoder model based on this paper https://arxiv.org/pdf/1801.10198.pdf.

GPT3 is MASSIVE. It encodes what it learns from training in 175 billion numbers (called parameters). These numbers are used to calculate which token to generate at each run.

These numbers are part of hundreds of matrices inside the model. Prediction is mostly a lot of matrix multiplication.

In my Intro to AI on YouTube, I showed a simple ML model with one parameter. A good start to unpack this 175B monstrosity.

High-level steps:

Convert the word to a vector (list of numbers) representing the word
Compute prediction
Convert resulting vector to word

The important calculations of the GPT3 occur inside its stack of 96 transformer decoder layers.

See all these layers? This is the “depth” in “deep learning”.

You can see a detailed explanation of everything inside the decoder in my blog post The Illustrated GPT2.

The difference with GPT3 is the alternating dense and sparse self-attention layers.

My assumption is that the priming examples and the description are appended as input, with specific tokens separating examples and the results. Then fed into the model.

Fine-tuning actually updates the model’s weights to make the model better at a certain task.

Artificial Intelligence

Immersive Learning

Organic Chemistry

Virus

GenAI Bot

How GPT3 Works - Visualizations and Animations