I’m presently (quite badly IMO) trying to anticipate the shape of the next big step in get-things-done/autonomy.
I’ve had a hunch for a while that temporally abstract planning and prediction is key. I strongly suspect you can squeeze more consequential planning out of shortish serial depth than most people give credit for. This is informed by past RL-flavoured stuff like MuZero and its limitations, by observations of humans and animals (inc myself), and by general CS/algos thinking. Actually this is where I get on the LLM train. It seems to me that language is an ideal substrate for temporally abstract planning and prediction, and lots of language data in the wild exemplifies this. NB I don’t think GPTs or LLMs are uniquely on this trajectory, just getting a big bootstrap.
Now, if I had to make the most concrete ‘inner homunculus’ case off the cuff, I’d start in the vicinity of Good Regulator, except a more conjectury version regarding systems-predicting-planners (I am working on sharpening this). Maybe I’d point at Janus’ Simulators post. I suspect there might be something like an impossibility/intractability theorem for predicting planners of the right kind without running a planner of a similar kind. (Handwave!)
I’d observe that GPTs can predict planning-looking actions, including sometimes without CoT. (NOTE here’s where the most concrete and proximal evidence is!) This includes characters engaging in deceit. I’d invoke my loose reasoning regarding temporal abstraction to support the hypothesis that this is ‘more than mere parroting’, and maybe fish for examples quite far from obvious training settings to back this up. Interp would be super, of course! (Relatedly, some of your work on steering policies via activation editing has sparked my interest.)
I think maybe this is enough to transfer some sense of what I’m getting at? At this point, given some (patchy) theory, the evidence is supportive of (among other hypotheses) an ‘inner planning’ hypothesis (of quite indeterminate form).
Finally, one kind or another of ‘conditioning’ is hypothesised to reinforce the consequentialist component(s) ‘somehow’ (handwave again, though I’m hardly the only one guilty of handwaving about RLHF et al). I think it’s appropriate to be uncertain what form the inner planning takes, what form the conditioning can/will take, and what the eventual results of that are. Interested in evidence and theory around this area.
So, what are we talking about when we say ‘LLM’? Plain GPT? Well, they definitely don’t ‘do what they’re told’[1]. They exhibit planning-like outputs with the right prompts, typically associated with ‘simulated characters’ at some resolution or other. What about RLHFed GPTs? Well, they sometimes ‘do what they’re told’. They also exhibit planning-like outputs with the right prompts, and it’s mechanistically very unclear how they’re getting them.
unless you mean predicting the next token (I’m pretty sure you don’t mean this?), which they do quite well, though we don’t know how, nor when it’ll fail
I’m presently (quite badly IMO) trying to anticipate the shape of the next big step in get-things-done/autonomy.
I’ve had a hunch for a while that temporally abstract planning and prediction is key. I strongly suspect you can squeeze more consequential planning out of shortish serial depth than most people give credit for. This is informed by past RL-flavoured stuff like MuZero and its limitations, by observations of humans and animals (inc myself), and by general CS/algos thinking. Actually this is where I get on the LLM train. It seems to me that language is an ideal substrate for temporally abstract planning and prediction, and lots of language data in the wild exemplifies this. NB I don’t think GPTs or LLMs are uniquely on this trajectory, just getting a big bootstrap.
Now, if I had to make the most concrete ‘inner homunculus’ case off the cuff, I’d start in the vicinity of Good Regulator, except a more conjectury version regarding systems-predicting-planners (I am working on sharpening this). Maybe I’d point at Janus’ Simulators post. I suspect there might be something like an impossibility/intractability theorem for predicting planners of the right kind without running a planner of a similar kind. (Handwave!)
I’d observe that GPTs can predict planning-looking actions, including sometimes without CoT. (NOTE here’s where the most concrete and proximal evidence is!) This includes characters engaging in deceit. I’d invoke my loose reasoning regarding temporal abstraction to support the hypothesis that this is ‘more than mere parroting’, and maybe fish for examples quite far from obvious training settings to back this up. Interp would be super, of course! (Relatedly, some of your work on steering policies via activation editing has sparked my interest.)
I think maybe this is enough to transfer some sense of what I’m getting at? At this point, given some (patchy) theory, the evidence is supportive of (among other hypotheses) an ‘inner planning’ hypothesis (of quite indeterminate form).
Finally, one kind or another of ‘conditioning’ is hypothesised to reinforce the consequentialist component(s) ‘somehow’ (handwave again, though I’m hardly the only one guilty of handwaving about RLHF et al). I think it’s appropriate to be uncertain what form the inner planning takes, what form the conditioning can/will take, and what the eventual results of that are. Interested in evidence and theory around this area.
So, what are we talking about when we say ‘LLM’? Plain GPT? Well, they definitely don’t ‘do what they’re told’[1]. They exhibit planning-like outputs with the right prompts, typically associated with ‘simulated characters’ at some resolution or other. What about RLHFed GPTs? Well, they sometimes ‘do what they’re told’. They also exhibit planning-like outputs with the right prompts, and it’s mechanistically very unclear how they’re getting them.
unless you mean predicting the next token (I’m pretty sure you don’t mean this?), which they do quite well, though we don’t know how, nor when it’ll fail