One is a bunch of very simple hardwired genomically-specified reward circuits over stuff like your sensory experiences or simple correlates of good sensory experiences.
I just want to flag that the word “simple” is contentious in this context. The above excerpt isn’t a specific claim (how simple is “simple”, and how big is “a bunch”?) so I guess I neither agree nor disagree with it as such. But anyway, my current guess (see here) is that reward circuits might effectively comprise tens of thousands of lines of pseudocode. That’s “simple” compared to a billion-parameter ML trained model, but it’s super complicated compared to any reward function that you could find in an RL paper on arxiv.
There seems to be a spectrum of opinion about how complicated the reward circuitry is, with Jacob Cannell at one extreme and Geoffrey Miller at the opposite extreme and me somewhere in the middle. See Geoffrey Miller’s post here, and my comment on it, and also a back-and-forth between me & Jacob Cannell here.
And so in the human brain you have, what’s basically self-supervised prediction of incoming sensory signals, like predictive processing, that sort of thing, in terms of learning to predictively model what’s going to happen in your local sensory environment. And then in deep learning, we have all the self-supervised learning of pre-training in language models that’s also learning to predict a sensory environment. Of course, the sensory environment in question is text, at least at the moment. But we’ve seen how that same approach easily extends to multimodal text plus image or text audio or whatever other systems.
I was a bit put off by the vibe that LLMs and humans both have self-supervised learning, and both have RL, so really they’re not that different. On the contrary, I think there are numerous alignment-relevant (and also capabilities-relevant) disanalogies between the kind of model-based RL that I believe the brain uses, versus LLM+RHLF.
Probably the most alignment-relevant of these disanalogies is that in LLMs (but not humans), the main self-supervised learning system is also simultaneously an output system.
Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.
I’m not trying to make any larger point about alignment being easy or hard, I just think it’s important to keep that particular difference clear in our heads. Well, OK, actually it is alignment-relevant—I think it weakly suggests that aligning brain-like model-based RL might be harder than aligning LLMs. (As I wrote here, “for my part, if I believed that [LLMs] were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!”) Speaking of which:
So for example, Steven Byrnes is an excellent researcher who thinks primarily about the human brain first and foremost and how to build brain-like AGI as his alignment approach.
(That’s very kind of you to say!) I think of brain-like AGI (and relatedly model-based RL AGI) as a “threat model” much more than an “alignment approach”—in the sense that future researchers might build brain-like AGI, and we need to plan for that possibility. My main professional interest is in finding alignment approaches that would be helpful for this threat model.
Separately, it’s possible that those alignment approaches might themselves be brain-like, i.e. whatever mechanisms lead to humans being (sometimes) nice to each other could presumably be a source of inspiration. That’s my main current project. But that’s not how I personally have been using the term “brain-like AGI”.
There’s no simple function of physical matter configurations that if tiled across the entire universe would fully satisfy the values. … It’s like, we tend to value lots of different stuff and we sort of asymptotically run out of caring about things when there’s lots of them, or decreasing marginal value for any particular fixed pattern.
(Sorry in advance if I’m missing your point.)
It’s worth noting that at least some humans seem pretty gung-ho about the project of tiling the universe with flourishing life and civilization (hi Eliezer!). Maybe they have other desires too, and maybe those other desires can be easily saturated, but that doesn’t seem safety-relevant. If superintelligent-Eliezer likes tiling the universe with flourishing life & civilization in the morning and playing cricket after lunch, then we still wind up with a tiled universe.
Or maybe you’re saying that if a misaligned AI wanted to tile the galaxy with something, it would be something more complicated and diverse than paperclips / tiny molecular squiggles? OK maybe, but if it isn’t sentient then I’m still unhappy about that.
Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.)
Qualitatively the differences between a purely predictively-trained LLM and one after RLHF seems quite large (e.g., see the comparison between GPT-3 and InstructGPT examples from OpenAI).
I was thinking: In pretraining they use 400,000,000,000,000 bits of information (or whatever) to sculpt the model from “every possible token string is equally likely” (or similar) to “typical internet text”. And then in RLHF they use 10,000 bits of information (or something) to sculpt the model from “typical internet text” to “annoyingly-chipper-bland-corporate-speak chatbot”. So when I say “most of the behavioral inclinations”, I guess I can quantify that as 99.99999999%? Or a different perspective is: I kinda feel like 10,000 bits (or 100,000, I don’t know what the number is) is kinda too small to build anything interesting from scratch, as opposed to tweaking the relative prominence of things that are already there. This isn’t rigorous or anything; I’m open to discussion.
(I absolutely agree that RLHF has very obvious effects on output.)
Nice interview, kudos to you both!
I just want to flag that the word “simple” is contentious in this context. The above excerpt isn’t a specific claim (how simple is “simple”, and how big is “a bunch”?) so I guess I neither agree nor disagree with it as such. But anyway, my current guess (see here) is that reward circuits might effectively comprise tens of thousands of lines of pseudocode. That’s “simple” compared to a billion-parameter ML trained model, but it’s super complicated compared to any reward function that you could find in an RL paper on arxiv.
There seems to be a spectrum of opinion about how complicated the reward circuitry is, with Jacob Cannell at one extreme and Geoffrey Miller at the opposite extreme and me somewhere in the middle. See Geoffrey Miller’s post here, and my comment on it, and also a back-and-forth between me & Jacob Cannell here.
I was a bit put off by the vibe that LLMs and humans both have self-supervised learning, and both have RL, so really they’re not that different. On the contrary, I think there are numerous alignment-relevant (and also capabilities-relevant) disanalogies between the kind of model-based RL that I believe the brain uses, versus LLM+RHLF.
Probably the most alignment-relevant of these disanalogies is that in LLMs (but not humans), the main self-supervised learning system is also simultaneously an output system.
Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.
I’m not trying to make any larger point about alignment being easy or hard, I just think it’s important to keep that particular difference clear in our heads. Well, OK, actually it is alignment-relevant—I think it weakly suggests that aligning brain-like model-based RL might be harder than aligning LLMs. (As I wrote here, “for my part, if I believed that [LLMs] were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!”) Speaking of which:
(That’s very kind of you to say!) I think of brain-like AGI (and relatedly model-based RL AGI) as a “threat model” much more than an “alignment approach”—in the sense that future researchers might build brain-like AGI, and we need to plan for that possibility. My main professional interest is in finding alignment approaches that would be helpful for this threat model.
Separately, it’s possible that those alignment approaches might themselves be brain-like, i.e. whatever mechanisms lead to humans being (sometimes) nice to each other could presumably be a source of inspiration. That’s my main current project. But that’s not how I personally have been using the term “brain-like AGI”.
(Sorry in advance if I’m missing your point.)
It’s worth noting that at least some humans seem pretty gung-ho about the project of tiling the universe with flourishing life and civilization (hi Eliezer!). Maybe they have other desires too, and maybe those other desires can be easily saturated, but that doesn’t seem safety-relevant. If superintelligent-Eliezer likes tiling the universe with flourishing life & civilization in the morning and playing cricket after lunch, then we still wind up with a tiled universe.
Or maybe you’re saying that if a misaligned AI wanted to tile the galaxy with something, it would be something more complicated and diverse than paperclips / tiny molecular squiggles? OK maybe, but if it isn’t sentient then I’m still unhappy about that.
(cc @Quintin Pope )
Qualitatively the differences between a purely predictively-trained LLM and one after RLHF seems quite large (e.g., see the comparison between GPT-3 and InstructGPT examples from OpenAI).
I was thinking: In pretraining they use 400,000,000,000,000 bits of information (or whatever) to sculpt the model from “every possible token string is equally likely” (or similar) to “typical internet text”. And then in RLHF they use 10,000 bits of information (or something) to sculpt the model from “typical internet text” to “annoyingly-chipper-bland-corporate-speak chatbot”. So when I say “most of the behavioral inclinations”, I guess I can quantify that as 99.99999999%? Or a different perspective is: I kinda feel like 10,000 bits (or 100,000, I don’t know what the number is) is kinda too small to build anything interesting from scratch, as opposed to tweaking the relative prominence of things that are already there. This isn’t rigorous or anything; I’m open to discussion.
(I absolutely agree that RLHF has very obvious effects on output.)
Thanks for your detailed comments!