A sufficiently advanced AGI familiar with humans will have a clear concept of “not killing everyone” (or more specifically, “what humans mean when they say the words ‘not killing everyone’”). We just add a bit of an extra component that makes the AGI intrinsically value that concept. This implies that capabilities progress may be closely linked to alignment progress.
Some major differences off the top of my head:
Picking out a specific concept such as “not killing everyone” and making the AGI specifically value that seems hard. I assume that the AGI would have some network of concepts and we would then either need to somehow identify that concept in the mature network, or design its learning process in such a way that the mature network would put extra weight on that. The former would probably require some kinds of interpretability tools for inspecting the mature network and making sense of its concepts, so is a different kind of proposal. As for the latter, maybe it could be done, but any very specific concepts don’t seem to have an equally simple/short/natural algorithmic description as simulating the preferences of others seems to have, so it’d seem harder to specify.
The framing of the question also implies to me that the AGI also has some other pre-existing set of values or motivation system besides the one we want to install, which seems like a bad idea since that will bring the different motivation systems into conflict and create incentives to e.g. self-modify or otherwise bypass the constraint we’ve installed.
It also generally seems like a bad idea to go manually poking around the weights of specific values and concepts without knowing how they interact with the rest of AGI’s values/concepts. Like if we really increase the weight of “don’t kill everyone” but don’t look at how it interacts with the other concepts, maybe that will lead to a With Folded Hands scenario when the AGI decides that letting people die by inaction is also killing people and it has to prevent humans from doing things that might kill them. (This is arguably less of a worry for something like “CEV”, but even we don’t seem to know what exactly CEV even should be, so I don’t know how we’d put that in.)
Thanks! Hmm, I think we’re mixing up lots of different issues:
1. Is installing a PF motivation into an AGI straightforward, based on what we know today?
I say “no”. Or at least, I don’t currently know how you would do that, see here. (I think about it a lot; ask me again next year. :) )
If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I don’t expect either of those to happen (well, at least not the former, and/or not directly). Things can emerge inside a trained model instead of being in the source code, and if so, then finding them is tricky.
2. Will installing a PF motivation into an AGI be straightforward in the future “by default” because capabilities research will teach us more about AGI than we know today, and/or because future AGIs will know more about the world than AIs today?
I say “no” to both. For the first one, I really don’t think capabilities research is going to help with this, for reasons here. For the second one, you write in OP that even infants can have a PF motivation, which seems to suggest that the problem should be solvable independent of the AGI understanding the world well, right?
3. Is figuring out how to install a PF motivation a good idea?
I say “yes”. For various reasons I don’t think it’s sufficient for the technical part of Safe & Beneficial AGI, but I’d rather be in a place where it is widely known how to install PF motivation, than be in a place where nobody knows how to do that.
4. Independently of which is a better idea, is the technical problem of installing a PF motivation easier, harder, or the same difficulty as the technical problem of installing a “human flourishing” motivation?
I’m not sure I care. I think we should try to solve both of those problems, and if we succeed at one and fail at the other, well I guess then we’ll know which one was the easier one. :-P
That said, based on my limited understanding right now, I think there’s a straightforward method kinda based on interpretability that would work equally well (or equally poorly) for both of those motivations, and a less-straightforward method based on empathetic simulation that would work for human-style PF motivation and maybe wouldn’t be applicable to “human flourishing” motivation. I currently feel like I have a better understanding of the former method, and more giant gaps in my understanding of the latter method. But if I (or someone) does figure out the latter method, I would have somewhat more confidence (or, somewhat less skepticism) that it would actually work reliably, compared to the former method.
Thanks, this seems like a nice breakdown of issues!
If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I don’t expect either of those to happen (well, at least not the former, and/or not directly). Things can emerge inside a trained model instead of being in the source code, and if so, then finding them is tricky.
So I don’t think that there’s going to be hand-written source code with slots for inserting variables. When I expect it to have a “natural” algorithmic description, I mean natural in a sense that’s something like “the kinds of internal features that LLMs end up developing in order to predict text, because those are natural internal representations to develop when you’re being trained to predict text, even though no human ever hand-coded or them or even knew what they would be before inspecting the LLM internals after the fact”.
Phrased differently, the claim might be something like “I expect that if we develop more advanced AI systems that are trained to predict human behavior and to act in a way that they predict to please humans, then there is a combination of cognitive architecture (in the sense that “transformer-based LLMs” are a “cognitive architecture”) and reward function that will naturally end up learning to do PF because that’s the kind of thing that actually does let you best predict and fulfill human preferences”.
The intuition comes from something like… looking at LLMs, it seems like language was in some sense “easy” or “natural”—just throw enough training data at a large enough transformer-based model, and a surprisingly sophisticated understanding of language emerges. One that probably ~nobody would have expected just five years ago. In retrospect, maybe this shouldn’t have been too surprising—maybe we should expect most cognitive capabilities to be relatively easy/natural to develop, and that’s exactly the reason why evolution managed to find them.
If that’s the case, then it might be reasonable to assume that maybe PF could be the same kind of easy/natural, in which case it’s that naturalness which allowed evolution to develop social animals in the first place. And if most cognition runs on prediction, then maybe the naturalness comes from something like there only being relatively small tweaks in the reward function that will bring you from predicting & optimizing your own well-being to also predicting & optimizing the well-being of others.
If you ask me what exactly that combination of cognitive architecture and reward function is… I don’t know. Hopefully, e.g. your research might one day tell us. :-) The intent of the post is less “here’s the solution” and more “maybe this kind of a thing might hold the solution, maybe we should try looking in this direction”.
2. Will installing a PF motivation into an AGI be straightforward in the future “by default” because capabilities research will teach us more about AGI than we know today, and/or because future AGIs will know more about the world than AIs today?
I say “no” to both. For the first one, I really don’t think capabilities research is going to help with this, for reasons here. For the second one, you write in OP that even infants can have a PF motivation, which seems to suggest that the problem should be solvable independent of the AGI understanding the world well, right?
I read your linked comment as an argument for why social instincts are probably not going to contribute to capabilities—but I think that doesn’t establish the opposite direction of “might capabilities be necessary for social instincts” or “might capabilities research contribute to social instincts”?
If my model above is right, that there’s a relatively natural representation of PF that will emerge with any AI systems that are trained to predict and try to fulfill human preferences, then that kind of a representation should emerge from capabilities researchers trying to train AIs to better fulfill our preferences.
3. Is figuring out how to install a PF motivation a good idea?
I say “yes”.
You’re probably unsurprised to hear that I agree. :-)
4. Independently of which is a better idea, is the technical problem of installing a PF motivation easier, harder, or the same difficulty as the technical problem of installing a “human flourishing” motivation?
What do you have in mind with a “human flourishing” motivation?
What do you have in mind with a “human flourishing” motivation?
An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world. There are a lot of unsolved problems and things that could go wrong with that (further discussion here), but I think something like that is not entirely implausible as a long-term alignment research vision.
I guess the anthropomorphic analog would be: try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says to you: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape. “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or whatever.)
How would that event change your motivations? Well, you’re probably going to spend a lot more time gazing at the moon when it’s in the sky. You’re probably going to be much more enthusiastic about anything associated with the moon. If there are moon trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a lunar exploration mission, maybe you would be first in line. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that.
Now by the same token, imagine we do that kind of thing for an extremely powerful AGI and the concept of “human flourishing”. What actions will this AGI then take? Umm, I don’t know really. It seems very hard to predict. But it seems to me that there would be a decent chance that its actions would be good, or even great, as judged by me.
I read your linked comment as an argument for why social instincts are probably not going to contribute to capabilities—but I think that doesn’t establish the opposite direction of “might capabilities be necessary for social instincts” or “might capabilities research contribute to social instincts”?
Sorry, that’s literally true, but they’re closely related. If answering the question “What reward function leads to human-like social instincts?” is unhelpful for capabilities, as I claim it is, then it implies both (1) my publishing such a reward function would not speed capabilities research, and (2) current & future capabilities researchers will probably not try to answer that question themselves, let alone succeed. The comment I linked was about (1), and this conversation is about (2).
If my model above is right, that there’s a relatively natural representation of PF that will emerge with any AI systems that are trained to predict and try to fulfill human preferences, then that kind of a representation should emerge from capabilities researchers trying to train AIs to better fulfill our preferences.
Sure, but “the representation is somewhere inside this giant neural net” doesn’t make it obvious what reward function we need, right? If you think LLMs are a good model for future AGIs (as most people around here do, although I don’t), then I figure those representations that you mention are already probably present in GPT-3, almost definitely to a much larger extent than they’re present in human toddlers. For my part, I expect AGI to be more like model-based RL, and I have specific thoughts about how that would work, but those thoughts don’t seem to be helping me figure out what the reward function should be. If I had a trained model to work with, I don’t think I would find that helpful either. With future interpretability advances maybe I would say “OK cool, here’s PF, I see it inside the model, but man, I still don’t know what the reward function should be.” Unless of course I use the very-different-from-biology direct interpretability approach (analogous to the “human flourishing” thing I mentioned above).
Some major differences off the top of my head:
Picking out a specific concept such as “not killing everyone” and making the AGI specifically value that seems hard. I assume that the AGI would have some network of concepts and we would then either need to somehow identify that concept in the mature network, or design its learning process in such a way that the mature network would put extra weight on that. The former would probably require some kinds of interpretability tools for inspecting the mature network and making sense of its concepts, so is a different kind of proposal. As for the latter, maybe it could be done, but any very specific concepts don’t seem to have an equally simple/short/natural algorithmic description as simulating the preferences of others seems to have, so it’d seem harder to specify.
The framing of the question also implies to me that the AGI also has some other pre-existing set of values or motivation system besides the one we want to install, which seems like a bad idea since that will bring the different motivation systems into conflict and create incentives to e.g. self-modify or otherwise bypass the constraint we’ve installed.
It also generally seems like a bad idea to go manually poking around the weights of specific values and concepts without knowing how they interact with the rest of AGI’s values/concepts. Like if we really increase the weight of “don’t kill everyone” but don’t look at how it interacts with the other concepts, maybe that will lead to a With Folded Hands scenario when the AGI decides that letting people die by inaction is also killing people and it has to prevent humans from doing things that might kill them. (This is arguably less of a worry for something like “CEV”, but even we don’t seem to know what exactly CEV even should be, so I don’t know how we’d put that in.)
Thanks! Hmm, I think we’re mixing up lots of different issues:
1. Is installing a PF motivation into an AGI straightforward, based on what we know today?
I say “no”. Or at least, I don’t currently know how you would do that, see here. (I think about it a lot; ask me again next year. :) )
If you have more thoughts on how to do this, I’m interested to hear them. You write that PF has a “simple/short/natural algorithmic description”, and I guess that seems possible, but I’m mainly skeptical that the source code will have a slot where we can input this algorithmic description. Maybe the difference is that you’re imagining that people are going to hand-write source code that has a labeled “this is an empathetic simulation” variable, and a “my preferences are being satisfied” variable? Because I don’t expect either of those to happen (well, at least not the former, and/or not directly). Things can emerge inside a trained model instead of being in the source code, and if so, then finding them is tricky.
2. Will installing a PF motivation into an AGI be straightforward in the future “by default” because capabilities research will teach us more about AGI than we know today, and/or because future AGIs will know more about the world than AIs today?
I say “no” to both. For the first one, I really don’t think capabilities research is going to help with this, for reasons here. For the second one, you write in OP that even infants can have a PF motivation, which seems to suggest that the problem should be solvable independent of the AGI understanding the world well, right?
3. Is figuring out how to install a PF motivation a good idea?
I say “yes”. For various reasons I don’t think it’s sufficient for the technical part of Safe & Beneficial AGI, but I’d rather be in a place where it is widely known how to install PF motivation, than be in a place where nobody knows how to do that.
4. Independently of which is a better idea, is the technical problem of installing a PF motivation easier, harder, or the same difficulty as the technical problem of installing a “human flourishing” motivation?
I’m not sure I care. I think we should try to solve both of those problems, and if we succeed at one and fail at the other, well I guess then we’ll know which one was the easier one. :-P
That said, based on my limited understanding right now, I think there’s a straightforward method kinda based on interpretability that would work equally well (or equally poorly) for both of those motivations, and a less-straightforward method based on empathetic simulation that would work for human-style PF motivation and maybe wouldn’t be applicable to “human flourishing” motivation. I currently feel like I have a better understanding of the former method, and more giant gaps in my understanding of the latter method. But if I (or someone) does figure out the latter method, I would have somewhat more confidence (or, somewhat less skepticism) that it would actually work reliably, compared to the former method.
Thanks, this seems like a nice breakdown of issues!
So I don’t think that there’s going to be hand-written source code with slots for inserting variables. When I expect it to have a “natural” algorithmic description, I mean natural in a sense that’s something like “the kinds of internal features that LLMs end up developing in order to predict text, because those are natural internal representations to develop when you’re being trained to predict text, even though no human ever hand-coded or them or even knew what they would be before inspecting the LLM internals after the fact”.
Phrased differently, the claim might be something like “I expect that if we develop more advanced AI systems that are trained to predict human behavior and to act in a way that they predict to please humans, then there is a combination of cognitive architecture (in the sense that “transformer-based LLMs” are a “cognitive architecture”) and reward function that will naturally end up learning to do PF because that’s the kind of thing that actually does let you best predict and fulfill human preferences”.
The intuition comes from something like… looking at LLMs, it seems like language was in some sense “easy” or “natural”—just throw enough training data at a large enough transformer-based model, and a surprisingly sophisticated understanding of language emerges. One that probably ~nobody would have expected just five years ago. In retrospect, maybe this shouldn’t have been too surprising—maybe we should expect most cognitive capabilities to be relatively easy/natural to develop, and that’s exactly the reason why evolution managed to find them.
If that’s the case, then it might be reasonable to assume that maybe PF could be the same kind of easy/natural, in which case it’s that naturalness which allowed evolution to develop social animals in the first place. And if most cognition runs on prediction, then maybe the naturalness comes from something like there only being relatively small tweaks in the reward function that will bring you from predicting & optimizing your own well-being to also predicting & optimizing the well-being of others.
If you ask me what exactly that combination of cognitive architecture and reward function is… I don’t know. Hopefully, e.g. your research might one day tell us. :-) The intent of the post is less “here’s the solution” and more “maybe this kind of a thing might hold the solution, maybe we should try looking in this direction”.
I read your linked comment as an argument for why social instincts are probably not going to contribute to capabilities—but I think that doesn’t establish the opposite direction of “might capabilities be necessary for social instincts” or “might capabilities research contribute to social instincts”?
If my model above is right, that there’s a relatively natural representation of PF that will emerge with any AI systems that are trained to predict and try to fulfill human preferences, then that kind of a representation should emerge from capabilities researchers trying to train AIs to better fulfill our preferences.
You’re probably unsurprised to hear that I agree. :-)
What do you have in mind with a “human flourishing” motivation?
An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world. There are a lot of unsolved problems and things that could go wrong with that (further discussion here), but I think something like that is not entirely implausible as a long-term alignment research vision.
I guess the anthropomorphic analog would be: try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says to you: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape. “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or whatever.)
How would that event change your motivations? Well, you’re probably going to spend a lot more time gazing at the moon when it’s in the sky. You’re probably going to be much more enthusiastic about anything associated with the moon. If there are moon trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a lunar exploration mission, maybe you would be first in line. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that.
Now by the same token, imagine we do that kind of thing for an extremely powerful AGI and the concept of “human flourishing”. What actions will this AGI then take? Umm, I don’t know really. It seems very hard to predict. But it seems to me that there would be a decent chance that its actions would be good, or even great, as judged by me.
Sorry, that’s literally true, but they’re closely related. If answering the question “What reward function leads to human-like social instincts?” is unhelpful for capabilities, as I claim it is, then it implies both (1) my publishing such a reward function would not speed capabilities research, and (2) current & future capabilities researchers will probably not try to answer that question themselves, let alone succeed. The comment I linked was about (1), and this conversation is about (2).
Sure, but “the representation is somewhere inside this giant neural net” doesn’t make it obvious what reward function we need, right? If you think LLMs are a good model for future AGIs (as most people around here do, although I don’t), then I figure those representations that you mention are already probably present in GPT-3, almost definitely to a much larger extent than they’re present in human toddlers. For my part, I expect AGI to be more like model-based RL, and I have specific thoughts about how that would work, but those thoughts don’t seem to be helping me figure out what the reward function should be. If I had a trained model to work with, I don’t think I would find that helpful either. With future interpretability advances maybe I would say “OK cool, here’s PF, I see it inside the model, but man, I still don’t know what the reward function should be.” Unless of course I use the very-different-from-biology direct interpretability approach (analogous to the “human flourishing” thing I mentioned above).
Update: writing this comment made me realize that the first part ought to be a self-contained post; see Plan for mediocre alignment of brain-like [model-based RL] AGI. :)