Reasoning modes

“Jagged intelligence” is a term coined by Andrej Karpathy to describe the highly uneven capabilities of modern Large Language Models (LLMs) and generative AI systems. It’s interesting to see that these systems are both impressively capable at, e.g., solving gold-medal IMO math questions , but at the same time, making basic elementary math errors .

In this research, we offer a partial explanation for this phenomenon: these models might not always use, or we might not know how to activate, their full potential.

When reasoning models perform a certain reasoning behavior (e.g. backtracking or verification), why do they choose to generate that step?

In this part of the series, we’ll gain insights to this question from post-training perspective, and more specifically, the post-training data.

During post-training, models are optimized via reinforcement learning or distillation to exhibit reasoning, there is a distinct risk that the optimization merely encourages the model to adopt the syntax of reasoning behaviors without the substance.

Let’s start with a “simple” knowledge question:

What is the capital of Vietnam?

It’s likely that most large reasoning models have been trained on this piece of knowledge and they can answer directly without thinking. However, when we let the models think, they still express uncertainty, as shown in the following reasoning trace.

Okay, so I need to figure out the capital of Vietnam. I’m not entirely sure, but I think it’s somewhere in Southeast Asia. I remember hearing that Vietnam has a capital, but I’m not 100% certain which city it is. Let me try to recall any information I might have.

…

I think the capital isn’t one of the really big cities like Ho Chi Minh or Hanoi. Wait, no, actually, I think Hanoi is the capital. I’ve heard of Hanoi being mentioned in the news a lot, especially regarding political events or historical contexts. It’s been a while since I studied this, so I’m a bit fuzzy on the details.

…

Why does the model expresses uncertainty in this case?

It turns out that changing a single prefix of the response could heavily affect the thinking pattern of models, as shown in the following examples.

Select question:

Question

Task Formulation

Planning Next Step

Evaluating Intermediate Results

Intermediate Numeric Calculations

Acknowledging uncertainty

Announcing Intermediate Conclusions

Reevaluation

Equation Rearrangement

The Prefix Effect: A single spurious starting token can heavily affect model reasoning behaviors.

As a result, when we perturb a single prefix token of thinking trace, the performance and response length vary significantly.

Models' default behavior shows over-thinking on factual QA (backtracking hurts) and under-thinking on counterfactual arithmetic (backtracking helps).

This finding also holds when we evaluate on math benchmarks such as GSM8k, MATH-500, AIME24, and AIME25.

Performance and Response Length across different prompt prefixes.

Forks in the roads

In our recent work, we investigate why distilled models exhibit such brittleness. Our key hypothesis is that linear and non-linear thinking represent distinct reasoning modes that co-exist in the training data, (for example, as a mixture of outputs from models like DeepSeek-V3 and DeepSeek-R1 ). During post-training, the model must reconcile these modes. However, because the rationale for choosing one over the other remains hidden, the model encounters “forks-in-the-road” during generation. At these decision points, the post-training objective pressures the model to commit to a specific path. Lacking the “correct mechanisms”, e.g., based on task difficulty, the model instead relies on spurious cues—such as a specific prefix token—to steer its trajectory. This “missing rationale” problem manifests at both the micro-level (e.g., algebraic manipulations) and the macro-level (e.g., overall strategy selection).

**Illustrative examples of forks in the road** (a) Graph navigation with indecipherable nodes, and (b) Mathematical reasoning with multiple valid solution modes. In both settings, decision points force commitment to a path without knowing which will succeed.

Pulling the rabbit out of the hat

To better understand these decision points, consider mathematical proof construction. A classic example is Euclid’s proof of the infinitude of prime numbers. The proof contains a crucial construction step (highlighted in blue) that requires a nontrivial conceptual leap: constructing a new number from the product of known primes plus one. Once this key insight is introduced, the remainder of the argument proceeds through comparatively routine deductions.

Euclid’s proof of the infinitude of prime numbers. The highlighted construction step illustrates a key conceptual leap that determines the success of the proof trajectory.

Such moments are often described informally as pulling the rabbit out of the hat: the decisive insight appears suddenly, while the reasoning that led to it remains hidden from view. We hypothesize that many reasoning traces in post-training data contain exactly these kinds of latent decision points. The final successful trajectory is observable, but the underlying rationale for choosing that trajectory over competing alternatives is not.

Forks in the roads and Coverage Shrinkage

To test this hypothesis, we designed controlled case studies that isolate and expose these decision-point structures. Our first setting is a graph-based navigation task, inspired by prior work on indecipherable nodes in next-token prediction and understanding reasoning chain .

In this task, a model must traverse a star graph from a start node to a target node, while encountering branching points that provide no information about which branch leads to success.

Q: Let each letter represent a numerical variable. These variables are defined as follows:
n = 10; m = n + 12; k = m + 3; h = k + 4; l = n + 19; j = l + 17; x = j + 2.
What is the resulting value of x?

We construct two controlled data variants that differ only in whether such decision points are present during post-training. In the Forward setting, the model encounters explicit decision points—states with multiple possible continuations where only one leads to a correct solution. Solving the problem therefore requires the model to decide and select among reasoning paths.

Forward Solution

To find the value of x, we compute the variables step by step:

n = 10
l = n + 19 = 10 + 19 = 29 ⚠ Decision point: At this step, the solution path can branch depending on which variable is computed.
j = l + 17 = 29 + 17 = 46
x = j + 2 = 46 + 2 = 48

Thus, the final answer is x = 48.

In contrast, the Reverse (w/o decision point) setting removes these forks by presenting the correct trajectory in reverse format (target to source), effectively eliminating any need for path selection by the model.

Reverse Solution

A: To solve the problem, we compute the variables step by step.

Substitute j = l + 17 into x, which gives
x = (l + 17) + 2 = l + 19.
Substitute l = n + 19 into x, which gives
x = (n + 19) + 19 = n + 38.
Compute x using n = 10, which gives
x = 10 + 38 = 48.

Thus, the final answer is x = 48.

Because both settings correspond to the same underlying task, any behavioral differences can be directly attributed to the impact of decision points in the data.

Change in model confidence at decision points over the course of SFT

The above figure shows that model’s confidence at decision points increases sharply throughout the training. However, this increase is not selective: the model is highly confident not only when it chooses the correct branch, but also when it chooses an incorrect one. This shows that training with decision points in data can push the model toward overconfident, single-path commitments, rather than calibrated uncertainty over multiple valid continuations. As a result, alternative trajectories are progressively suppressed, leading to the observed coverage shrinkage and drop in pass@k.

Effect of decision points on coverage in the graph navigation task. Pass@k across SFT epochs for Forward vs. Reverse (without decision points) problem-solving settings.

We further observe that the same coverage shrinkage emerges during RLVR when training on the Forward setting but not in the Reverse setting. This suggests that coverage shrinkage is driven not only by the learning algorithm, but also by the data and the presence of decision points in reasoning. Therefore, not all correct reasoning traces are equal.

Pass@k performance when running GRPO on models pretrained on forward and reverse (-DP) solutions.

Not all diversity is equal.

Mixture of reasoning and non-reasoning examples in post-training data influences a model's behavioral tendencies.

Next, we investigate whether models trained on mixed data can learn to balance different reasoning modes under repeated sampling. A key question is how the structure of diversity in training data affects this decision. In our experiments, we construct two data designs with identical diversity ratios (50% natural language (NL), 50% code) but different organization (the above Figure): Data-level diversity: each problem is solved using a single mode, but the dataset is globally balanced across the modes; Problem-level diversity: each problem appears with both reasoning modes. This setup isolates whether coverage depends not just on how much diversity is present, but how it is distributed.

How two styles of data mixing (data-level vs problem-level) control the effective coverage and diversity in reasoning traces.

Under data-level diversity, the model becomes increasingly confident in selecting a mode per problem, leading to a bimodal distribution that favors either code or NL. This aligns with the overconfidence at decision points in the graph setting, where increasing certainty concentrates probability mass on a few trajectories, causing coverage shrinkage and a drop in pass@k. In contrast, problem-level diversity yields a more balanced and calibrated distribution.

If first tokens act as decision points, can we use them to recover lost coverage? In this experiment, we enforce perturbation in the sampling of first token among top-k options instead of the standard decoding (without the need for retraining!). We observe that it can effectively nudge the model into different reasoning paths, and significantly restore their lost coverage.

**Recovering coverage via prefix perturbation.** Top-$k$ prefix sampling (Top-8) mitigates coverage shrinkage and improves pass@k at larger $k$

Conclusion

These results point toward a new perspective: improving reasoning in LLMs may require not only scaling compute or refining objectives, but also explicitly modeling and preserving the structure of reasoning paths during data curation and training. In particular, this perspective helps explain why developing a single model that robustly operates across both instruct (non-thinking) and reasoning (thinking) modes remains challenging . We hope this lens is useful for future work on reasoning models, data design, and test-time scaling strategies.