To me, the natural explanation is that they were not trained for sequential decision making and therefore lose coherence rapidly when making long term plans. If I saw an easy patch I wouldn’t advertise it, but I don’t see any easy patch—I think next token prediction works surprisingly well at producing intelligent behavior in contrast to the poor scaling of RL in hard environments. The fact that it hasn’t spontaneously generalized to succeed at sequential decision making (RL style) tasks is in fact not surprising but would have seemed obvious to everyone if not for the many other abilities that did arise spontaneously.
It’s also due to LLMs just not being reliable enough for anything more than say 90% reliability, which is generally unacceptable in a lot of domains that have any lasting impact.
That definitely seems like part of the problem. Sholto Douglas and Trenton Bricken make that point pretty well in their discussion with Dwarkesh Patel from a while ago.
It’ll be interesting to see whether the process supervision approach that OpenAI are reputedly taking with ‘Strawberry’ will make a bit difference to that. It’s a different framing (rewarding good intermediate steps) but seems arguably equivalent.
To me, the natural explanation is that they were not trained for sequential decision making and therefore lose coherence rapidly when making long term plans. If I saw an easy patch I wouldn’t advertise it, but I don’t see any easy patch—I think next token prediction works surprisingly well at producing intelligent behavior in contrast to the poor scaling of RL in hard environments. The fact that it hasn’t spontaneously generalized to succeed at sequential decision making (RL style) tasks is in fact not surprising but would have seemed obvious to everyone if not for the many other abilities that did arise spontaneously.
It’s also due to LLMs just not being reliable enough for anything more than say 90% reliability, which is generally unacceptable in a lot of domains that have any lasting impact.
That definitely seems like part of the problem. Sholto Douglas and Trenton Bricken make that point pretty well in their discussion with Dwarkesh Patel from a while ago.
It’ll be interesting to see whether the process supervision approach that OpenAI are reputedly taking with ‘Strawberry’ will make a bit difference to that. It’s a different framing (rewarding good intermediate steps) but seems arguably equivalent.