There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection—if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1⁄10,000.
I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.
There are some subskills to having consistent goals that I think will be selected for, at least when outcome-based RL starts working to get models to do long-horizon tasks. For example, the ability to not be distracted/nerdsniped into some different behavior by most stimuli while doing a task. The longer the horizon, the more selection—if you have to do a 10,000 step coding project, then the probability you get irrecoverably distracted on one step has to be below 1⁄10,000.
I expect some pretty sophisticated goal-regulation circuitry to develop as models get more capable, because humans need it, and this makes me pretty scared.