It’s hard to apply general strategic reasoning to anything in a single forward pass, isn’t it? If your LLM has to come up with an answer that begins with the next token, you’d better hope the next token is right. IIRC this is the popular explanation for why LLM output seems to be so much better when you just add something like “Let’s think step by step” to the prompt.
Is anyone trying to incorporate this effect into LLM training yet? Add an “I’m thinking” and an “I’m done thinking” to the output token set, and only have the main “predict the next token in a way that matches the training data” loss function grade on tokens that aren’t in between those brackets. Then when you hit “What is 45235 + 259719? 304954” in the training set, optimization doesn’t have to discourage multi-step reasoning to reproduce that, because “<thinking>5+9=14, so we carry the 1“ … “</thinking>304954” is still worth just as much as an ex nihilo “304954”. Chess algorithms could do a brief tree search before outputting their final decision.
Add whatever regularization is needed to keep the “train of thought” in English rather than in whatever-cypher-the-optimizer-hits-on, and this would be an increase in safety, not just in capabilities. The more “internal reasoning” is human-readable text rather than maybe-a-posteriori-interpretable activation patterns, the better. You could even expose it to the end user: ask ChatGPT a question and you get a succinct answer, click on the “expose thoughts” button and you get the chain of support for the answer.
This is also basically an idea I had—I actually made a system design and started coding it, but haven’t made much progress due to lack of motivation… Seems like it should work, though
It’s hard to apply general strategic reasoning to anything in a single forward pass, isn’t it? If your LLM has to come up with an answer that begins with the next token, you’d better hope the next token is right. IIRC this is the popular explanation for why LLM output seems to be so much better when you just add something like “Let’s think step by step” to the prompt.
Is anyone trying to incorporate this effect into LLM training yet? Add an “I’m thinking” and an “I’m done thinking” to the output token set, and only have the main “predict the next token in a way that matches the training data” loss function grade on tokens that aren’t in between those brackets. Then when you hit “What is 45235 + 259719? 304954” in the training set, optimization doesn’t have to discourage multi-step reasoning to reproduce that, because “<thinking>5+9=14, so we carry the 1“ … “</thinking>304954” is still worth just as much as an ex nihilo “304954”. Chess algorithms could do a brief tree search before outputting their final decision.
Add whatever regularization is needed to keep the “train of thought” in English rather than in whatever-cypher-the-optimizer-hits-on, and this would be an increase in safety, not just in capabilities. The more “internal reasoning” is human-readable text rather than maybe-a-posteriori-interpretable activation patterns, the better. You could even expose it to the end user: ask ChatGPT a question and you get a succinct answer, click on the “expose thoughts” button and you get the chain of support for the answer.
This is also basically an idea I had—I actually made a system design and started coding it, but haven’t made much progress due to lack of motivation… Seems like it should work, though