Reinforcement Learning from Human Feedback (RLHF) - Andrej Karpathy's Deep Dive on LLMs (Part 10)

In reinforcement learning, there is an important branch called "Reinforcement Learning from Human Feedback" (RLHF), which is particularly adept at handling problem domains that are difficult to directly verify (unverifiable domains).

Most of what we have dealt with so far are "verifiable domains," meaning that any answer within these domains can be easily compared to a standard answer. For example, if the standard answer is the number "3," we can simply check whether the model's answer is "3."

We can even use a mechanism called an "LLM judge" to evaluate the model's answers by scoring them, which has proven sufficiently reliable in practice to automatically complete the grading process using the capabilities of large language models (LLMs).

However, the real challenge for RLHF lies in those areas where answers cannot be easily verified, known as "unverifiable domains." For instance, creative writing tasks such as writing a joke about pelicans, composing a poem, or summarizing a passage. These tasks cannot simply be compared to a specific answer.

For example, we ask the model to create a joke about pelicans:

Why don't pelicans ever pay their own bills? Because they always "beak" (peek) and pass it on to others. Clearly, this joke isn't very successful.

Although the model can generate a large number of jokes, evaluating their quality is a challenge. Theoretically, we could have humans check and score each one individually, but the reinforcement learning process often involves tens of thousands of iterations, each producing hundreds or even thousands of samples, making it impossible for humans to check each one.

Therefore, Reinforcement Learning from Human Feedback (RLHF) was developed. By guiding the model with limited human feedback, it enables the model to learn how to produce high-quality outputs in similar unverifiable domains. This method greatly reduces the workload of human involvement while improving the model's performance in complex tasks such as creative writing and dialogue generation.

How does RLHF work specifically?

If we had unlimited human time, theoretically, we could continuously improve the model through direct human feedback. For example, we could do 1000 updates, evaluating 1000 prompts per update, and each prompt generating 1000 answers, requiring humans to assess 1 billion jokes in total. This is clearly impractical.

To solve this problem, OpenAI (some of whose members later founded Anthropic) proposed a method of RLHF.

The core trick of this method lies in "indirect guidance": involving humans only to a limited extent, specifically by training an additional neural network called a "reward model" to simulate human scoring.

First, humans rank a small amount of generated content (from best to worst) instead of directly scoring it, since ranking is relatively easier. This ranking becomes the data for training the reward model.

Then, the reward model takes prompts (such as "write a joke about pelicans") and generated content as input, outputting a score between 0 and 1, representing an evaluation from worst to best.

For example, if someone ranks five jokes, the reward model also scores these five jokes, and we use mathematical methods (defining a loss function) to align the reward model's scores with the human rankings. When the model's scores do not align with human rankings, we adjust the reward model through supervision to reduce this gap, gradually making the scores closer to human evaluations.

Through this method, we can efficiently perform a large number of automatic evaluations, significantly expanding the application scope of reinforcement learning. Although this simulator is not perfect, as long as its scores are statistically close enough to human judgments, the practical application effects will be significantly improved.

What are the advantages of RLHF?

The main advantage of Reinforcement Learning from Human Feedback is:

It allows the application of reinforcement learning in any domain (including unverifiable domains), such as writing poetry, jokes, or summarizing content.
This method has been proven to significantly enhance model performance, possibly related to the "gap between distinguishing and generating difficulty."
By having humans perform simpler tasks (such as ranking rather than direct creation), we obtain more accurate and reliable feedback data, thereby enhancing model performance.

What are the limitations of RLHF?

Despite the obvious advantages of RLHF, there are also some significant limitations:

Simulation error: The reward model is merely an approximate simulation of human feedback and may not perfectly reflect real human judgment. It might be misleading.
Adversarial sample problem: Since the reward model is a complex neural network, reinforcement learning may discover special inputs (adversarial samples) that receive high scores but are actually meaningless. This situation is called "gaming the model."
Adversarial: Unlike verifiable domains (such as Go), RLHF cannot be optimized indefinitely; otherwise, the model is easily misled and produces absurd outputs. Initially, it improves, but then dramatically falls off a cliff, resulting in very absurd outcomes, like giving "the the the the the the the the" when asked to write a joke. Typically, optimization must be stopped after a certain number of iterations to prevent the reward model from being completely misled.

Therefore, although RLHF can effectively improve model performance, it is more suitable for limited fine-tuning tasks rather than infinite optimization reinforcement learning tasks.

The relationship between RLHF and RL

In the field of RLHF, although this method is closely related to traditional reinforcement learning (RL), there are essential differences. Simply put, although RLHF belongs to reinforcement learning, it is not traditional RL because it cannot optimize without limits.

In traditional reinforcement learning scenarios, such as Go, we can clearly determine wins and losses, possessing a perfect simulator, allowing reinforcement learning to continuously optimize and eventually surpass human performance. However, in RLHF, we use a reward model (reward model) as a simulator of human feedback. Fundamentally, the RLHF model is just a complex neural network, and it has errors in mimicking human scoring, making it susceptible to deception (i.e., generating adversarial samples).

Therefore, although RLHF can effectively improve model performance through indirect human feedback, it is more akin to a limited "fine-tuning" rather than true infinite-optimization reinforcement learning. This approach can slightly improve model performance, but there is no "magic" of unlimited enhancement through increased computational resources and continuous optimization.