Andrej Karpathy Deep Dive on LLM (Part 3): Post-Training

The training of LLMs (Large Language Models) is divided into two main stages:

Pretraining: Training the Base Model.
Post-training: Fine-tuning the Base Model into an interactive Assistant Model.

1. Pretraining

The core goal of the pretraining stage is tolearn the statistical patterns of internet text, which is essentially aToken prediction task：

By training the model on massive amounts of internet data, it learns how to predict Token sequences.
The final product of the training is theBase Model, which is merely aToken generator, similar to a highly advancedautocompletion system。

Deficiencies of the Base Model：

It does not understand questions, but only generates text based on the statistical patterns of the training data.
It cannot reject inappropriate questions, and may output unsafe or inaccurate content.
It lacks interactivity, unable to engage in multi-turn conversations or perform complex tasks.

2. Post-training

To transform the Base Model into a trueAI assistant, post-training is required, which is much less costly than pretraining. Post-training mainly includes:

Instruction Tuning: Teaching the model to answer questions rather than just predicting Tokens.
Reward Model Training: Defining what constitutes "good" and "bad" responses.
Reinforcement Learning with Human Feedback (RLHF): Using human feedback to optimize the model, making its answers more aligned with human expectations.

3. Key Data for Training AI Assistants: Dialogue Data

Sources of Dialogue Data：

Human Labeling: Hiring human labelers to writeideal responses, training the model to mimic these responses.
Existing Datasets: Fine-tuning using data from forums, social media, etc.
Synthetic Data: Generating high-quality responses using existing LLMs, then having humans review them.

Dialogue Examples：

There may be thousands of dialogues like the following, very long, covering a wide range of topics.

Training Process：

Organizing some dialogues labeled by humans, and having humans provide ideal answers first.
Marking the dialogue as a sequence of Tokens, where <im_start> is a new token that has never been trained before, created only in the post-training phase to indicate whether the dialogue initiator is the User or the Assistant.
The neural network trains on this dialogue data, making the model mimic these responses.

The model mimics these responses.
The model will quickly adjust to learn the response style of the human labelers.
During the inference stage, the model will continue to predict the next best token.

4. Training Methods

Differences from Pretraining：

Smaller Dataset Size(Pretraining uses tens of trillions of Tokens, while post-training may use only millions of dialogues).
Shorter Training Time(Pretraining may take 3 months, while post-training usually takes only a few hours to days).
Different Goals(Pretraining learns Token statistical patterns, while post-training learns interactive behavior).

5. Instruction Tuning

In 2022, OpenAI released aInstructGPTresearch paper, detailing for the first time how to fine-tune a Base Model into an Assistant Model that better meets user needs throughdialogue data tuning.AI assistant (Assistant Model)。

Human-labeled data

During theInstructGPTtraining process, OpenAI hiredhuman labelers, whose tasks were:

Designing user inputs (Prompts), such as:

- How to rekindle career passion?
- Recommend 10 science fiction novels.
- Translate this sentence into Spanish.

Providing ideal assistant responsesto ensure the quality of the model's replies.

Labeling Guidelines

To standardize, labelers must followdetailed labeling guidelines(usually hundreds of pages), with core requirements:

Helpful: Responses should meet user needs.
Truthful: Provide accurate information, avoiding hallucinations.
Harmless: Avoid outputting unsafe or harmful content.

Through these guidelines, companiesindirectly program the behavior of AI assistants, ensuring their responses meet expectations.

6. Existing Open-Source Attempts

Although OpenAI has not made the InstructGPT training data public, some open-source projects (such asOpen Assistant) have attempted to replicate similar processes:

Users contribute example dialogues, such as:

Q: Please explain the concept of "monopoly" in economics and give an example.
A: Monopoly refers to a market with only one supplier... (specific answer)

Using crowdsourcing to create high-quality training dataand fine-tuning the model.

7. The Role of AI-Generated Data

Over the past few years, since the release of theInstructGPTpaper,the way AI assistants are trained has changed significantly. Today, human labelers no longer manually write all the training data, but insteaduse LLMs to generate part of the data, which is then reviewed and optimized by humans.

AI-assisted data creation

Past: Human labelerswriting from scratchdialogue data to provide training examples for AI assistants.
Present: LLMsfirst generating responses, then having humansreview and modify, improving efficiency.

For example, the datasetUltraChatis primarilyAI-generatedbut edited by humans to ensure quality. This approach significantly reduces training costs and expands the coverage of the dataset.

Currently, these datasets contain millions of dialogues. Most are synthetic and may be edited by humans. They have a very large diversity. The figure below shows SFT Mixtures (Supervised Fine-Tuning data mixtures), during the Supervised Fine-Tuning (SFT) stage, multiple different source datasets are used for fine-tuning to improve the model's performance across various tasks.

8. Statistical Simulation vs. True Intelligence

When interacting with AI,the model's responses are not based on real-time search, but rather onstatistical simulationof the human labelers in the training data. For example:

If the question appears in the training data, the AI may output a highly similar answer.
If the question is not in the training data, the AI will infer possible answers based onstatistical patterns, but there may be errors.

Essentially, ChatGPT is not a truly "understanding" AI, but an optimized "human labeler simulator,"similar to asking educated experts who would be hired to write these answers in relevant fields.

9. How AI Generates Responses

When a user inputs in ChatGPT:

Recommend the top five landmarks in Paris?

The AI does not perform a real-time search, but rather:

Checks if the question is in the training data, and if so, may return a similar answer.
Combines a large amount of information about Paris from pretraining with the post-training dataset.
If the question is not in the training data, it infers based on statistical patterns, guessing：

Frequently occurring landmarks are more likely to be recommended(e.g., Eiffel Tower, Louvre).
Infrequently occurring content may be ignored or answered incorrectly.。