Reasoning modes

When reasoning models perform a certain reasoning behavior (e.g. backtracking or verification), why do they chose to generate that step?

In this part of the series, we’ll gain insights to this question from post-training perspective, and more specically, the post-training data.

During post-training, models are optimized via reinforcement learning or distillation to exhibit reasoning, there is a distinct risk that the optimization merely encourages the model to adopt the syntax of verification without the substance.


Let’s start with a “simple” knowledge question:

What is the capital of Vietnam?

It’s likely that most large reasoning models have been trained on these piece of knowledge and they can answer directly without thinking. However, when we let the models think, they still express uncertanty, as shown in the following reasoning trace.

Okay, so I need to figure out the capital of Vietnam. I’m not entirely sure, but I think it’s somewhere in Southeast Asia. I remember hearing that Vietnam has a capital, but I’m not 100% certain which city it is. Let me try to recall any information I might have.

I think the capital isn’t one of the really big cities like Ho Chi Minh or Hanoi. Wait, no, actually, I think Hanoi is the capital. I’ve heard of Hanoi being mentioned in the news a lot, especially regarding political events or historical contexts. It’s been a while since I studied this, so I’m a bit fuzzy on the details.

Why does the model expresses uncertainty in this case?

It turns out that changing a single prefix of the response could heavily affect the thinking pattern of models, as shown in the following examples.

Question

Task Formulation
Planning Next Step
Evaluating Intermediate Results
Intermediate Numeric Calculations
Acknowledging uncertainty
Announcing Intermediate Conclusions
Reevaluation
Equation Rearrangement

As a result, when we prefix the thinking trace with a single token, the performance and response length vary significantly.

Models' default behavior shows over-thinking on factual QA (backtracking hurts) and under-thinking on counterfactual arithmetic (backtracking helps).

This finding also holds when we evaluate on math benchmarks such as GSM8k, MATH-500, AIME24, and AIME25.

Performance and Response Length across different prompt prefixes.

Forks in the roads

In our recent work, we explore why does the distilled models exhibit this brittleness. Our key hypothesis is that these linear and non-linear thinking are different reasoning modes co-exist in the training data (or mixtures of data generated from DeepSeek-V3 and DeepSeek-R1), and for which model must resolve during post-training. Specifically, the rationale for chosing one mode over another is hidden. This leads to forks-in-the-road situations in which a model encounters multiple valid, indistinguishable reasoning paths. At such points, post-training objective pressures the model to commit to a subset of these options available. This “missing rationale” problem manifests at both micro-level (step-by-step or which algebraic manipulation to apply next) and macro-level (strategy or mode selection). And when the model is exposed to a single reasoning path without justification, it may struggle to learn the “correct mechanisms” (e.g., based on the difficulty or the type of problems) and instead rely on spurious cues (such as the prefix token as in the ealier examples) to steer it toward a particular path.

Illustrative examples of forks in the road (a) Graph navigation with indecipherable nodes, and (b) Mathematical reasoning with multiple valid solution modes. In both settings, decision points force commitment to a path without knowing which will succeed.

Forks in the roads and Coverage Shrinkage

To test this hypothesis, we designed controlled case studies that isolate and expose these decision-point structures. Our first setting is a graph-based navigation task, inspired by prior work on indecipherable nodes in next-token prediction .

Q: Let each letter represent a numerical variable. These variables are defined as follows:
n = 10; m = n + 12; k = m + 3; h = k + 4; l = n + 19; j = l + 17; x = j + 2.
What is the resulting value of x?

Ground Truth Solution

To find the value of x, we compute the variables step by step:

  • n = 10
  • l = n + 19 = 10 + 19 = 29 ⚠ Decision point: At this step, the solution path can branch depending on which variable is computed.
  • j = l + 17 = 29 + 17 = 46
  • x = j + 2 = 46 + 2 = 48

The resulting value of x is 48.

n m l k j h x

In this task, a model must traverse a star graph from a start node to a target node, while encountering branching points that provide no information about which branch leads to success.

Change in model confidence at decision points over the course of SFT

The above figure shows that model’s confidence at decision points increases sharply throughout the training. However, this increase is not selective: the model is highly confident not only when it chooses the correct branch, but also when it chooses an incorrect one. This shows that training with decision points in data can push the model toward overconfident, single-path commitments, rather than calibrated uncertainty over multiple valid continuations. As a result, alternative trajectories are progressively suppressed, leading to the observed coverage shrinkage and drop in pass@k.

Effect of decision points on coverage in the graph navigation task. Pass@k across SFT epochs for Forward vs. Reverse (without decision points) problem-solving settings.

We further observe the same coverage shrinkage emerges during RLVR when training on Forward setting but not in Reverse setting. This suggests that coverage shrinkage is driven not only by the learning algorithm, but also by the data and the presence of decision points in reasoning.

Pass@k performance when running GRPO on models pretrained on forward and reverse (-DP) solutions.

Mixture of reasoning and non-reasoning examples in post-training data influences a model's behavioral tendencies.

Next, we investigate whether models trained on mixed data can learn to balance different reasoning modes under repeated sampling. A key question is how the structure of diversity in training data affects this decision. In our experiments, we construct two data designs with identical diversity ratios (50% natural language (NL), 50% code) but different organization (the above Figure): Data-level diversity: each problem is solved using a single mode, but the dataset is globally balanced across the modes; Problem-level diversity: each problem appears with both reasoning modes. This setup isolates whether coverage depends not just on how much diversity is present, but how it is distributed.

How two styles of data mixing (data-level vs problem-level) control the effective coverage and diversity in reasoning traces.

The model hasn’t remove alternatives—they’re just suppressed.

Recovering coverage via prefix perturbation. Top-$k$ prefix sampling (Top-8) mitigates coverage shrinkage and improves pass@k at larger $k$

Conclusion

Our finding also helps explain why building a unified reasoning model that can effectively operate in both instruct (or non-thinking) and thinking modes remains challenging .

We hope that our insights provide a useful lens for those working on reasoning models, data curation, or test-time scaling.

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Multi-Armed Bandits
  • Scaling the Giants: A Guide to Efficient Parallelism in LLM Inference
  • Scaling compute
  • Learning to search
  • Tản mạn