What you might do is impose a curriculum:
In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
There’s another relevant paper “Compressed Chain of Thought: Efficient Reasoning through Dense Representations” which used teacher forcing. Although I haven’t read the whole thing yet.
I would say: don’t ignore the feeling. Calibrate it and train it, until it’s worth listening to.
there’s a good book about this: “Sizing People Up”