Why Models Need "Step-by-Step Thinking" - Andrej Karpathy's In-Depth Explanation of LLMs (Part 6)

This chapter discusses the native computational capabilities of LLMs in problem-solving. It is quite enlightening to see how models "think" when analyzing many tricky problems.

Previous chapter:

Andrej Karpathy's In-Depth Explanation of Large Language Model (LLM) Technology (Part 1) - [Pre-training and Inference]
Andrej Karpathy's In-Depth Explanation of LLM (Part 2): Understanding Training and Inference through GPT-2 and Llama 3.1
Andrej Karpathy's In-Depth Explanation of LLM (Part 3): Post-Training
Andrej Karpathy Deep Dive on LLM (Part 4): Hallucinations
LLM's "Self-Awareness" - Andrej Karpathy Deep Dive on LLM (Part 5)

Example 1

Emily bought 3 apples and 2 oranges. Each orange costs $2, and the total cost is $13. What is the price of the apples?

This is a very simple math problem. Now, suppose the model gives two different answers as follows, located on the left and right sides respectively. Both answers arrive at the correct result, which is3However, one of the answers is clearly better for the assistant, while the other is an extremely poor response.

Data labelers need to select one as a training example; in this case, one answer would be considered very poor, and the other acceptable. If the wrong answer is used, the model may perform very poorly in mathematical calculations and could lead to undesirable results.

The key to this question lies in recognizing and remembering that during model training and inference, they operate according toa unidirectional token sequence from left to rightfor computation. The token sequence evolves from left to right, and each time the next token is generated, all existing tokens are fed into the neural network, which then calculatesThe probability distribution of the next token。

During this computation process, the input tokens are fed into the neural network, which performs a series of calculations involving operations of multiple neurons, ultimately outputting the probability distribution of the next token.

An important point to understand is that, mathematically speaking, the number of layers involved in these computations is finite. The following example has 3 layers; modernstate-of-the-artneural networks may haveabout 100 layersHowever, no matter what, the number of computation layers performed to derive the probability of the next token from the previous token sequence is always finite.

The computational cost for each token is almost fixed, although this isn't entirely accurate. When the number of input tokens increases, the cost of the neural network's forward pass also increases, but this growth is not significant. Therefore,a good way to understand it is: for each token in the sequence, there is roughly a fixed amount of computation.。

this computational cost cannot be too large because the number of layers in the model is limited. Looking vertically, the number of computation layers is not high, soA single forward pass cannot perform any arbitrary complex computation.This means that the model's reasoning and computation processmust be distributed across multiple tokens, because a single token consumes only a limited amount of computational resources.

We cannot expect the model to perform extensive computations while generating a specific token.Since the computation per token is fixed and constrained by the number of model layers. Therefore, we needto distribute the computational burden among multiple tokens, allowing the model to perform reasoning step-by-step instead of expecting all computations to be completed on a single token. This is precisely why in the example above, one answer turns out to be much worse than the othermuch worse。

Imagine that the model must generate tokens one by one from left to right. It needs to output sequentially“The answer is ”(The answer is), then is“$”(Dollar sign), following isthis critical positionneedsto compress the entire mathematical calculation processtoa token(18 in the figure below), it directly outputs the correct answer "3".

The problem is, once the model outputs“3”this token, the subsequent tokens are justSubsequent explanation of the answer, but this is not the actual calculation process; rather,it is a post hoc addition. In other words,the answer was already determined when the model generated "3", and the subsequent text merely elaborates on the known answerIt will not affect how the model truly performs mathematical calculations.

If the training data causes the model to directly output the final answer without going through a reasoning process, then it is actuallytraining the model toguessthe answer, rather thanCalculateAnswerThis doesn't work in the reasoning process becauseEach token can only use a limited amount of computation。

This is whyThe answer on the right is significantly better.— It distributes the computation process instead of compressing everything into a single token. In the answer on the right, the model derives the final answer step by step:

The total price of oranges is $4.
Total price $13 - Price of oranges $4 = $9
$9 divided by 3 apples, each apple costs $3.

Thisstep-by-step reasoningapproach allows each token to only perform relatively simple calculations, without needing to complete all reasoning at once. This not only aligns with the model's computational limitations but also makes it easier for the model to arrive at the correct answer during inference.

If the model is guided during training tocalculate everything all at once, then it is likely unable to perform complex calculations during inference becausethe amount of computation each token can perform is limited. Therefore,the correct training approach is to have the model distribute the reasoning process across multiple tokens; only in this way can it correctly perform calculations during inference.

This is important when designing a Prompt, but in most cases, users don't need to explicitly consider this issue because OpenAI's annotators have already optimized it during data annotation, so ChatGPT willslowly derive the answerinstead of giving the answer directly. For example, it will firstdefine variables, list equations, and solve step by stepnot for explaining to humans, but forHelp the model to conduct reasoning for itself. If the model cannot generate these intermediate steps for itself, it will not be able to deduce the correct answer "3".

Example 2

If you directly ask the LLM to provide an answer without allowing it to reason. For instance, the same math problem is provided, but the model is requiredto answer within a single tokenThat is, directly output the answer without performing any extra calculations.

On this simple question, the modelsucceeded ingiving the correct answer in one forward pass. However, this answer actually consists of two tokens because the dollar sign$is also a separate token. Therefore, strictly speaking, it did not fully comply with the requirement toa single tokenis completed, but still withina single forward passthe correct answer was derived.

However, this result is limited to simple numbers. If we try to increase the difficulty, for example,Emily bought 23 apples and 177 oranges, which increases the computational complexity of the problem. The model is asked againto respond within a single token, and it answered "5" — an answerthat is incorrect。

Why is this? Because when the problem becomes complex, the model can no longercomplete all calculations in a single forward pass.In other words,it cannot finish all mathematical operations within a single token, which ultimately leads to computational errors.

When cancelingsingle token restriction, allowing the model to solve problems in a normal way, it begins to generate a series ofintermediate calculations, for example:

Calculate the total price of oranges
Calculate the total price of apples
Calculate the price of a single apple

TheseIntermediate calculation stepsFor the model,the computation for each token is relatively small，so it can accurately arrive at the correct answer7. However, if it is required to do so ina single forward passto complete all calculations within it exceeds the model's capabilities, leading to computational errors.

Why is the model's mental arithmetic unreliable?

In practical applications, we cannot fully trust the model's mental arithmetic abilities, especially when numbers become larger. Becauseneural networks are inherently not suited for performing mathematical calculations; they only approximate reasoning through pattern recognition.

A more reliable approach is to have the modelcall codeto complete the calculation. For example, the model could use Python code instead of relying on its mental arithmetic ability. The model can generate Python code to perform calculations, and the correctness of the code is more reliable than the model's "mental arithmetic".

The model is essentially just a text prediction system, while the Python interpreter is a specialized tool for executing mathematical calculations. The LLM writes the program, the computer runs it, and then the LLM accesses the results of the computation.Rather than having the model calculate on its own, it should be made to call Python code.because this way the results will be more accurate.

Example 3

For the same reason, the model performs poorly incountingas well.

For example, provide a sequence of dots ("."] to the model and ask: "How many dots are below?"
The model tries to directly calculate withina single tokenthe number of dots, but it often gives the wrong answer.

Why? BecauseIt must complete all calculations in one forward pass.，butthe computational capacity within a single token is limited.，and through these subsequent token IDs. Therefore, it cannot accurately count the number of points.

If we change the approach,Let it use Python code, the model will generate code like this:

In this case, the model's task is justto copy the input,Then call the Python interpreter to executelen(dots)the calculation of the number of points. This is morereliablebecause the Python interpreter performsdeterministic calculations，while the neural network inference of the model is unstable.