Andrej Karpathy in-depth explanation of large language model (LLM) technology (Part 1) - [Pretraining and Inference]

Last week, Andrej Karpathy uploaded a video on YouTube with an in-depth explanation of the technology behind large language models (LLMs), exploring the AI training architecture behind ChatGPT and related products. Video link: https://www.youtube.com/watch?v=7xTGNNLPyMI.

The video not only systematically introduces the training process of LLMs but also analyzes from a cognitive perspective the "thinking methods" of these models and how to maximize their utility in practical applications. Andrej was one of the founding members of OpenAI (in 2015) and later served as Senior Director of AI at Tesla (2017-2022). He is currently the founder of Eureka Labs, dedicated to building AI-native schools. The goal of this video is to popularize the latest AI technology so that more people can efficiently utilize this cutting-edge tool.

and【

Introduction

This video is aimed at a general audience and introduces LLMs (large language models) like ChatGPT. They are powerful tools that, in some ways, seem almost magical, but they also have their own limitations. This video will explore the tasks that LLMs excel at and those they struggle with, delve into how they work under the hood, and explain how these models are built, as well as touch on the cognitive psychology implications of these tools.

Pre-training phase

Step one: Data processing

The first step in pre-training is to obtain and process internet data, leveraging a large amount of publicly available text resources. These data sources are extensive, containing a large number of high-quality and diverse documents, forming the basis for LLM training.

[dataset link🔗: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1], which is a filtered web dataset with high representativeness. In production environments, the final data volume of FineWeb is approximately 44TB. Despite the vast size of the internet, the ultimately usable text data for training has been strictly filtered, so it is not considered an extremely large dataset—today, this volume of data can even be stored on a single hard drive.

web pages. It works by starting with a few seed pages and continuously crawling and indexing content along the links in web pages, thus accumulating a large amount of internet data. Since the quality of these raw data varies, multiple rounds of screening and processing are required to ensure the quality of the final dataset.

Data filtering process

This stage primarily excludes web pages that are unsuitable as training data sources based on a "blocklist," including malware sites, spam sites, marketing sites, racist content, adult content, etc. This ensures the quality of the training data and prevents the model from learning inappropriate information.
The raw data crawled is usually in the HTML format of web pages, which contains a large amount of information unrelated to the text, such as HTML tags, CSS styles, navigation menus, etc. In order to extract the core text content, it is necessary to parse the HTML structure, remove redundant web elements, and retain the main body information.
For example, FineWeb uses a language classifier to detect the language of web pages and only retains the text of the main language. For instance, if more than 65% of a webpage's content is in English, it will be retained. This screening strategy is a design decision that different institutions can choose on their own. For example, if a dataset filters out all Spanish web pages, the final trained model may not be good at Spanish. Therefore, different companies may adopt different strategies in terms of multi-language support based on their needs. FineWeb mainly focuses on English, so the model trained from this dataset will perform better in English, but may be weaker in other languages.
This stage includes deduplication processing to avoid the same content being learned multiple times. In addition, personal identity information (PII, such as addresses, social security numbers, etc.) will be detected and filtered to prevent the model from learning sensitive data.

Final data example

On the Hugging Face website, anyone can download the FineWeb dataset and view the final data samples.

For example, a news report about a tornado in 2012

A medical article introducing the function of the adrenal glands

These texts, after being screened, represent high-quality content from different categories on the Internet.

Currently, the final data volume processed by FineWeb is approximately 44TB. To intuitively demonstrate the scale of this data, it can be viewed as a collection of massive web pages that have been cleaned, filtered, and processed to provide the best pre-training data foundation.

Text representation and Tokenization

of data, and require the input data to be composed of a finite set of symbols. Therefore, we need to define these symbols and convert the text into a one-dimensional sequence of these symbols.

from text to binary representation

Although we see two-dimensionally arranged text on the computer screen.

That is 0 and 1. For example, if we use UTF-8 encoding to convert text, the computer ultimately stores the corresponding binary data.

to find a balance between them.

From binary to bytes

There are 256 different combinations, so each byte can represent 256 different symbols (i.e., values between 0 and 255).

sequences composed of them.

For example:

Hello → [72, 101, 108, 108, 111]

Thus, the length of the original text is reduced by 8 times, but the number of symbols increases to 256.

Further compression: Byte Pair Encoding (BPE)

, in order to further reduce the sequence length while increasing the size of the symbol set.

The basic principle of BPE：

(For example: byte 116 is often followed by 32).
Treat this high-frequency combination as a new symbol and assign it a unique ID (such as 256).
Replace the original symbol pair with this new symbol to reduce the sequence length.
Repeat this process until the preset upper limit of the number of symbols is reached.

a token.

Tokenization process

is the process of converting raw text into a sequence of tokens. For example, in the GPT-4 tokenizer:

"Hello world" → ["Hello", " world"]

"Hello" is encoded as Token ID
" world" is encoded as Token ID
, for example:

might be encoded into
is

Explore Tokenization

to explore the tokenization method of GPT-4:

as the tokenizer (the basic token library used by GPT-4).
and see how many tokens it is broken into along with their corresponding IDs.
Try different inputs, such as adding extra spaces or changing case, to observe changes in the tokenization results.

dataset. These tokens can be considered as the smallest units of text; the numbers themselves have no meaning. Each token is a unique ID, similar to the "atoms of text."

Step three: Neural network training

The training objective of the neural network

language modeling.

Context Window

as input.
The size of these windows can vary, typically ranging from 4,000 to 16,000 tokens. For example, GPT-4 might use 8,000 tokens as its context window.
as the context window:

Input for the neural network

(e.g., 8,000).
, that is:

the output of the neural network

The goal of the neural network is
.
For example:

Error calculation and model updating

Therefore, the predicted probabilities are random. To improve the accuracy of predictions, we adjust the network weights using the following method:

Calculate the error

Since we have true labels (i.e., the actual next token), we can calculate the model's error. For example:
, while at the same time

Backpropagation

Calculates how to adjust the weights of a neural network so that predictions better match the statistical patterns of real data.
For example, after one training iteration:
This process iterates continuously, enabling the model to more accurately predict the relationships between tokens.

Parallel computing

During the training process, the neural network does not handle only one token window, but
Each token will participate in the training, updating the model weights to make the model closer to the statistical relationships between tokens in the dataset.

, thereby predicting the next most likely Token to appear. Below, we delve deeper into the internal structure of the neural network, as well as how the Transformer architecture efficiently carries out this process.

Internal structure of neural networks

Input of neural networks

(e.g., 8,000 Tokens). These inputs are converted into numerical form and serve as the basis for neural network calculations. In principle, the context window can be infinitely large, but the computational cost of processing them would become extremely high.

For example, suppose our input Token sequence is:

[91, 860, 287, 11579] → Predict the next Token

to perform mathematical calculations and generate predicted values.

Parameters (Weights) of the neural network

In neural networks, there are a large number of parameters (weights) that determine how the model processes input data and makes predictions:

Modern LLMs (such as GPT-4) usually contain
Therefore, the initial predictions of the neural network are completely random.
Through iterative training, these parameters will gradually be adjusted so that the model output better conforms to the statistical patterns in the data.

Adjusting these knobs changes how the neural network makes predictions. The core goal of training a neural network is to find an optimal set of parameters such that the output of the neural network

Mathematical calculations of neural networks

. But fundamentally, they are composed of basic mathematical operations (addition, multiplication, exponentiation, etc.). For example, a simple neural network can be expressed as follows:

Among them:

: Numerical representation of input tokens.
: Parameters (weights) of the neural network.

The core objective of the entire neural network is to continuously optimize these parameters so that they can better fit the statistical patterns of the training data.

You can also check out the visual effects of neural networks on this website. https://bbycroft.net/llm

Transformer architecture

. This architecture is specifically designed for processing large-scale text data, capable of efficiently learning the relationships between Tokens and generating new text.

The core computational flow of the Transformer can be divided into the following stages:

1. Input Token sequence

as input.
, as the basis for neural network computations.

2. Token Embedding (Embedding)

, this vector is the representation of a Token in the neural network.
It represents the relationships between different Tokens.

3. Computation through Transformer layers

Information flows through the neural network and passes through multiple computational layers, each of which has a different function:

: Standardizes data to ensure stable training.
: Transform data to extract features.
: Calculate the relationships between Tokens to ensure the model can understand context.
：Further process information to enhance the model's expressive power.

It allows the model:

but not just the most recent Token.
making the generated text more coherent.

4. Output Layer

, converts all possible next Tokens into
The model will select a Token as the final output based on these probabilities.

Neural networks vs. human brain

Although the computation process of Transformer is sometimes metaphorically described as "artificial neurons" being "activated," it fundamentally differs from how the human brain operates:

; each input is independently computed without long-term memory.
that can store long-term information and perform advanced reasoning.
All its computations are based on mathematical formulas and matrix operations, not biological learning and thinking like the brain.

Neural networks, such as nano-GPT, are mathematical functions consisting of more than 80,000 fixed parameters that transform inputs into outputs. Adjusting these parameters affects the prediction results, and the goal of training is to find the optimal parameters so that predictions match the patterns in the training data.

. Now, we move into another crucial phase of the LLM workflow:

Inference

This means that we can input a piece of text based on a pre-trained LLM, and the model will predict and generate subsequent content. This is also how models like ChatGPT operate in practical applications.

The basic process of inference

The inference process can be divided into the following steps:

Input the initial Token (Prefix)

(also called a prefix), which is equivalent to the user's input prompt. For example:
This Token is used as the starting point and input into the neural network.

The model calculates the probability distribution

The neural network will calculate the probability distribution for all possible next Tokens
in the current context.

Random sampling (Sampling)

If the token with the highest probability is always chosen, the generated text will appear very rigid.
, the model can generate more diverse texts.
but will
as the next Token:

Cycle generation

The selected Token 860 is appended to the sequence, serving as the new context:
to form a complete text output.

Randomness in inference

Therefore:

The same input may result in different outputs.。
but it will
The generated content may be a "remix" of the training data rather than a verbatim reproduction.

For example:

In the training data, "| viewing single" may frequently appear before "article."
or it might choose other related Tokens.
training data.

Inference vs Training

Training phase

, so that it can better predict the next Token.
, performs a large number of matrix calculations and gradient descent optimization.
and will no longer be updated.

Inference stage

The goal is
Only one run of neural network computation is required, without involving parameter updates.
The computation speed is much faster than training, but still requires substantial computational resources.

Predict and generate possible answers.