When reasoning models perform a certain reasoning behavior (e.g. backtracking or verification), why do they chose to generate that step?
In this part of the series, we’ll gain insights to this question from post-training perspective, and more specically, the post-training data.
During post-training, models are optimized via reinforcement learning or distillation to exhibit reasoning, there is a distinct risk that the optimization merely encourages the model to adopt the syntax of verification without the substance.
Let’s start with a “simple” knowledge question:
It’s likely that most large reasoning models have been trained on these piece of knowledge and they can answer directly without thinking. However, when we let the models think, they still express uncertanty, as shown in the following reasoning trace.
Okay, so I need to figure out the capital of Vietnam. I’m not entirely sure, but I think it’s somewhere in Southeast Asia. I remember hearing that Vietnam has a capital, but I’m not 100% certain which city it is. Let me try to recall any information I might have.
…
I think the capital isn’t one of the really big cities like Ho Chi Minh or Hanoi. Wait, no, actually, I think Hanoi is the capital. I’ve heard of Hanoi being mentioned in the news a lot, especially regarding political events or historical contexts. It’s been a while since I studied this, so I’m a bit fuzzy on the details.
…
Why does the model expresses uncertainty in this case?
It turns out that changing a single prefix of the response could heavily affect the thinking pattern of models, as shown in the following examples.
As a result, when we prefix the thinking trace with a single token, the performance and response length vary significantly.
This finding also holds when we evaluate on math benchmarks such as GSM8k, MATH-500, AIME24, and AIME25.
In our recent work, we explore why does the distilled models exhibit this brittleness. Our key hypothesis is that these linear and non-linear thinking are different reasoning modes co-exist in the training data (or mixtures of data generated from DeepSeek-V3 and DeepSeek-R1
To test this hypothesis, we designed controlled case studies that isolate and expose these decision-point structures. Our first setting is a graph-based navigation task, inspired by prior work on indecipherable nodes in next-token prediction
Q: Let each letter represent a numerical variable. These variables are defined as follows:
n = 10; m = n + 12; k = m + 3; h = k + 4; l = n + 19; j = l + 17; x = j + 2.
What is the resulting value of x?
To find the value of x, we compute the variables step by step:
The resulting value of x is 48.
In this task, a model must traverse a star graph from a start node to a target node, while encountering branching points that provide no information about which branch leads to success.
The above figure shows that model’s confidence at decision points increases sharply throughout the training. However, this increase is not selective: the model is highly confident not only when it chooses the correct branch, but also when it chooses an incorrect one. This shows that training with decision points in data can push the model toward overconfident, single-path commitments, rather than calibrated uncertainty over multiple valid continuations. As a result, alternative trajectories are progressively suppressed, leading to the observed coverage shrinkage and drop in pass@k.
We further observe the same coverage shrinkage emerges during RLVR when training on Forward setting but not in Reverse setting. This suggests that coverage shrinkage is driven not only by the learning algorithm, but also by the data and the presence of decision points in reasoning.
Next, we investigate whether models trained on mixed data can learn to balance different reasoning modes under repeated sampling. A key question is how the structure of diversity in training data affects this decision. In our experiments, we construct two data designs with identical diversity ratios (50% natural language (NL), 50% code) but different organization (the above Figure): Data-level diversity: each problem is solved using a single mode, but the dataset is globally balanced across the modes; Problem-level diversity: each problem appears with both reasoning modes. This setup isolates whether coverage depends not just on how much diversity is present, but how it is distributed.
The model hasn’t remove alternatives—they’re just suppressed.
Our finding also helps explain why building a unified reasoning model that can effectively operate in both instruct (or non-thinking) and thinking modes remains challenging
We hope that our insights provide a useful lens for those working on reasoning models, data curation, or test-time scaling.
Here are some more articles you might like to read next: