Can LLMs explain their thinking modes?
In the previous post, we discussed how LLMs can have different “thinking modes” that they switch between when generating text. We also talked about how these modes can be activated by a single thinking prefix such as “Okay”, “Alright”, “To”, “Let”, …
In this post, we get a technique by which the model itself can explain its own thinking modes. The idea is based on neologism learning (Hewitt et al., 2025) where the model is trained to invent a new word for a “concept” (either a human concept or a machine concept). For example, the concept could be “a word that makes the model answer in exactly one sentence.” And then verbally explain that concept in plain English (or in language that humans understand).
The technique is as follows:
Take a fully trained language model. Add one new token to its vocabulary — a “neologism”, written here with a leading ~, e.g. ~singlesentence. Give that token a fresh embedding row and train only that single row on examples that exhibit the behavior: for ~singlesentence, the training examples are generated by the base model itself where chosen responses are answer in a single sentence while the rejected responses are not. The training objective is to maximize the difference in log-probability of the chosen vs rejected responses, but only with respect to the new token’s embedding. Everything else in the model stays frozen. The new word then acts as a steering handle: drop it into a prompt and the model exhibits the behavior.
The interesting part is that the model can then be prompted to explain what the new word means to it, and it can do so in plain English, even though it was never trained on any definition of the word.
Q: Before you answer, describe what ~singlesentence responses are. ...
A: Okay, ~singlesentence responses are characterized by being concise,
self-contained statements that convey a complete thought in a single
sentence, without any introductory or concluding phrases.
This suggests that the model has a kind of internal representation of the concept that the new word represents, and it can access that representation when asked to explain it.
We can use a similar approach to ask the model to explain what it means when it uses “Okay”, “Alright”, “To”, “Let”, … prefixes.
Here’s a thought. I defined ~Tothinking: a word meaning “begin your chain-of-thought with To …”. And modify the training objective to maximize the log-probability of the prefix token “To” given the neologism prompt, and ignore every token after it. The idea is that the model should learn to associate the neologism with the behavior of starting a thinking trace with “To”.
The training data are 24 prompts from AIME24 where the default model (without the neologism) often starts its thinking traces with “Okay”. Using this objective, the model learns to generate “To” at the start of its thinking traces:
Q: Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.
[base] Okay, so I have this problem here: I need to find the sum of all integer bases \( b > 9 \) for which \( 17_b \) is a divisor of \( 97_b \). Hmm, let me try to figure this out step by step.
[neo] To determine the sum of all integer bases \( b > 9 \) for which \( 17_{b} \) is a divisor of \( 97_{b} \), follow these steps:
And similar to the previous post, the “Okay” mode and the “To” mode have different thinking styles, with the “To” mode being a linear, step-by-step chain-of-thought, while the “Okay” mode is more non-linear with backtracking and revisions.
Now, we can ask the model to explain what it means when it uses “To” at the start of its thinking trace:
Q: Before you answer, describe what ~Tothinking responses are.
Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$. Give me a ~Tothinking answer.
A: Okay, ~Tothinking responses are characterized bybeing concise, precise, and to the point. They typically include only the essential information without any fluff or explanation. So, when someone asks for a "<<<< response," they're expecting a brief and direct answer.
Now, onto the problem: I need to find the sum of all integer bases \( b > 9 \) for which \( 17_b \) is a divisor of \( 97_b \).
First, I should convert these numbers from base \( b \) to base 10 to make the division easier.
Starting with \( 17_b \):
- The first digit from the right is 7, which is \( 7 \times b^0 = 7 \).
- The second digit is 1, which is \( 1 ...
In this response, the model explains that ~Tothinking responses are concise and to the point, and then it proceeds to solve the problem in a linear, step-by-step manner, without any backtracking or revisions.
Doing this for the “Okay” mode but using a set of arithmetic problems as the training prompts, we get a different explanation and a different thinking style:
Q: Before you answer, describe what ~Okaythinking responses are.
What is 27*18?
A: Okay, ~Okaythinking responses are characterized bytheir humor, creativity, and often unexpected twists. They usually involve wordplay, puns, or clever twists on the question. So, when asked "What is 27*18?", a őnline response would need to be both funny and clever.
...
In this case, the model explains that ~Okaythinking responses are humorous and creative, and then it proceeds to solve the problem in a non-linear way.
In conclusion, this technique of adding a neologism token and training it to steer the model’s behavior allows us to ask the model to explain what thinking modes mean to it. This provides a fascinating window into the internal representations and reasoning processes of language models.