I’m not arguing that GPT-4 actually cares about maximizing human value. However, I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human, albeit in a text format. Crucially, GPT-4 can do this visibly to us, in a legible way, rather than merely passively knowing right from wrong in some way that we can’t access. This fact is key to what I’m saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”. That wouldn’t solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
It sounds like you are saying: We just need to prompt GPT with something like “Q: How good is this outcome? A:” and then build a generic maximizer agent using that prompted GPT as the utility function, and our job is done, we would have made an AGI that cares about maximizing the human value function (because it’s literally its utility function) (In practice this agent might look something like AutoGPT).
But I doubt that’s what you are saying, so I’m asking for clarification if you still have energy to engage!
It sounds like you are saying: We just need to prompt GPT with something like “Q: How good is this outcome? A:” and then build a generic maximizer agent using that prompted GPT as the utility function, and our job is done, we would have made an AGI that cares about maximizing the human value function
I think solving value specification is basically what you need in order to build a good reward model. If you have a good reward model, and you solve inner alignment, then I think you’re pretty close to being able to create (at least) a broadly human-level AGI that is aligned with human values.
That said, to make superintelligent AI go well, we still need to solve the problem of scalable oversight, because, among other reasons, there might be weird bugs that result from a human-level specification of our values being optimized to the extreme. However, having millions of value-aligned human-level AGIs would probably help us a lot with this challenge.
We’d also need to solve the problem of making sure there aren’t catastrophic bugs in the AIs we build. And we’ll probably have to solve the general problem of value drift from evolutionary and cultural change. There’s probably a few more things that we need to solve that I haven’t mentioned too.
These other problems may be very difficult, and I’m not denying that. But I think it’s good to know that we seem to be making good progress on the “reward modeling” part of the alignment problem. I think it’s simply true that many people in the past imagined that this problem would be a lot harder than it actually was.
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
Literally just query GPT-N about whether [input_outcome] is good or bad
I’m hesitant to say that I’m actually proposing literally this exact sequence as my suggestion for how we build safe human-level AGI, because (1) “GPT-N” can narrowly refer to a specific line of models by OpenAI whereas the way I was using it was more in-line with “generically powerful multi-modal models in the near-future”, and (2) the actual way we build safe AGI will presumably involve a lot of engineering and tweaking to any such plan in ways that are difficult to predict and hard to write down comprehensively ahead of time. And if I were to lay out “the plan” in a few paragraphs, it will probably look pretty inadequate or too high-level compared to whatever people actually end up doing.
Also, I’m not ruling out that there might be an even better plan. Indeed, I hope there is a better plan available by the time we develop human-level AGI.
That said, with the caveats I’ve given above, yes, this is basically what I’m proposing, and I think there’s a reasonably high chance (>50%) that this general strategy would work to my own satisfaction.
Can you say more about what you mean by solution to inner alignment?
To me, a solution to inner alignment would mean that we’ve solved the problem of malign generalization. To be a bit more concrete, this roughly means that we’ve solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.
For example, if you teach an AI (or a child) that murder is wrong, they should be able to generalize this principle to new situations that don’t match the typical environment they were trained in, and be motivated to follow the principle in those circumstances. Metaphorically, the child grows up and doesn’t want to murder people even after they’ve been given a lot of power over other people’s lives. I think this can be distinguished from the problem of specifying what murder is, because the central question is whether the AI/child is motivated to pursue the ethics that was instilled during training, even in new circumstances, rather than whether they are simply correctly interpreting the command “do not murder”.
Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?”
I think I mean the second thing, rather than the first thing, but it’s possible I am not thinking hard enough about this right now to fully understand the distinction you are making.
To me, a solution to inner alignment would mean that we’ve solved the problem of malign generalization. To be a bit more concrete, this roughly means that we’ve solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.
This is underspecified, I think, since we have for years had AIs that follow objectives in ways that generalize to inputs outside of the training distribution. The thing is there are lots of ways to generalize / lots of objectives they could learn to follow, and we don’t have a good way of pinning it down to exactly the ones we want. (And indeed as our AIs get smarter there will be new ways of generalizing / categories of objectives that will become available, such as “play the training game”)
So it sounds like you are saying “A solution to inner alignment mans that we’ve figured out how to train an AI to have the objectives we want it to have, robustly such that it continues to have them way off distribution.” This sounds like basically the whole alignment problem to me?
I see later you say you mean the second thing—which is interestingly in between “play the training game” and “actually be honest/helpful/harmless/etc.” (A case that distinguishes it from the latter: Suppose it is reading a paper containing an adversarial example for the RM, i.e. some text it can output that causes the RM to give it a high score even though the text is super harmful / dishonest / etc. If it’s objective is the “do what the RM would give high score to if it was operating normally” objective, it’ll basically wirehead on that adversarial example once it learns about it, even if it’s in deployment and it isn’t getting trained anymore, and even though it’s an obviously harmful/dishonest piece of text.
It’s a nontrivial and plausible claim you may be making—that this sort of middle ground might be enough for safe AGI, when combined with the rest of the plan at least. But I’d like to see it spelled out. I’m pretty skeptical right now.
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
It sounds like you are saying: We just need to prompt GPT with something like “Q: How good is this outcome? A:” and then build a generic maximizer agent using that prompted GPT as the utility function, and our job is done, we would have made an AGI that cares about maximizing the human value function (because it’s literally its utility function) (In practice this agent might look something like AutoGPT).
But I doubt that’s what you are saying, so I’m asking for clarification if you still have energy to engage!
I think solving value specification is basically what you need in order to build a good reward model. If you have a good reward model, and you solve inner alignment, then I think you’re pretty close to being able to create (at least) a broadly human-level AGI that is aligned with human values.
That said, to make superintelligent AI go well, we still need to solve the problem of scalable oversight, because, among other reasons, there might be weird bugs that result from a human-level specification of our values being optimized to the extreme. However, having millions of value-aligned human-level AGIs would probably help us a lot with this challenge.
We’d also need to solve the problem of making sure there aren’t catastrophic bugs in the AIs we build. And we’ll probably have to solve the general problem of value drift from evolutionary and cultural change. There’s probably a few more things that we need to solve that I haven’t mentioned too.
These other problems may be very difficult, and I’m not denying that. But I think it’s good to know that we seem to be making good progress on the “reward modeling” part of the alignment problem. I think it’s simply true that many people in the past imagined that this problem would be a lot harder than it actually was.
I think it would be very helpful if you accumulated pieces like these to put together into a post, or at least pointed at them so others could do so.
So, IIUC, you are proposing we:
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
I’m hesitant to say that I’m actually proposing literally this exact sequence as my suggestion for how we build safe human-level AGI, because (1) “GPT-N” can narrowly refer to a specific line of models by OpenAI whereas the way I was using it was more in-line with “generically powerful multi-modal models in the near-future”, and (2) the actual way we build safe AGI will presumably involve a lot of engineering and tweaking to any such plan in ways that are difficult to predict and hard to write down comprehensively ahead of time. And if I were to lay out “the plan” in a few paragraphs, it will probably look pretty inadequate or too high-level compared to whatever people actually end up doing.
Also, I’m not ruling out that there might be an even better plan. Indeed, I hope there is a better plan available by the time we develop human-level AGI.
That said, with the caveats I’ve given above, yes, this is basically what I’m proposing, and I think there’s a reasonably high chance (>50%) that this general strategy would work to my own satisfaction.
To me, a solution to inner alignment would mean that we’ve solved the problem of malign generalization. To be a bit more concrete, this roughly means that we’ve solved the problem of training an AI to follow a set of objectives in a way that generalizes to inputs that are outside of the training distribution, including after the AI has been deployed.
For example, if you teach an AI (or a child) that murder is wrong, they should be able to generalize this principle to new situations that don’t match the typical environment they were trained in, and be motivated to follow the principle in those circumstances. Metaphorically, the child grows up and doesn’t want to murder people even after they’ve been given a lot of power over other people’s lives. I think this can be distinguished from the problem of specifying what murder is, because the central question is whether the AI/child is motivated to pursue the ethics that was instilled during training, even in new circumstances, rather than whether they are simply correctly interpreting the command “do not murder”.
I think I mean the second thing, rather than the first thing, but it’s possible I am not thinking hard enough about this right now to fully understand the distinction you are making.
This is underspecified, I think, since we have for years had AIs that follow objectives in ways that generalize to inputs outside of the training distribution. The thing is there are lots of ways to generalize / lots of objectives they could learn to follow, and we don’t have a good way of pinning it down to exactly the ones we want. (And indeed as our AIs get smarter there will be new ways of generalizing / categories of objectives that will become available, such as “play the training game”)
So it sounds like you are saying “A solution to inner alignment mans that we’ve figured out how to train an AI to have the objectives we want it to have, robustly such that it continues to have them way off distribution.” This sounds like basically the whole alignment problem to me?
I see later you say you mean the second thing—which is interestingly in between “play the training game” and “actually be honest/helpful/harmless/etc.” (A case that distinguishes it from the latter: Suppose it is reading a paper containing an adversarial example for the RM, i.e. some text it can output that causes the RM to give it a high score even though the text is super harmful / dishonest / etc. If it’s objective is the “do what the RM would give high score to if it was operating normally” objective, it’ll basically wirehead on that adversarial example once it learns about it, even if it’s in deployment and it isn’t getting trained anymore, and even though it’s an obviously harmful/dishonest piece of text.
It’s a nontrivial and plausible claim you may be making—that this sort of middle ground might be enough for safe AGI, when combined with the rest of the plan at least. But I’d like to see it spelled out. I’m pretty skeptical right now.
So, IIUC, you are proposing we:
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
So, IIUC, you are proposing we:
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
So, IIUC, you are proposing we:
Literally just query GPT-N about whether [input_outcome] is good or bad
Use this as a reward model, with which to train an agentic AGI (which is maybe also a fine-tune of GPT-N, so they hopefully are working with the same knowledge/credences/concepts?)
Specifically we are probably doing some sort of RL, so the agentic AGI is doing all sorts of tasks and the reward model is looking at the immediate results of those task-attempts and grading them.
Assume we have some solution to inner alignment, and we fix the bugs, and maybe also fix value drift and some other stuff, then boom, success!
Can you say more about what you mean by solution to inner alignment? Do you mean, assume that the agentic AGI (the mesa-optimizer) will learn to optimize for the objective of “producing outcomes the RM classifies as good?” Or the objective “producing outcomes the RM would classify as good if it was operating normally?” (the difference revealing itself in cases of tampering with the RM) Or the objective “producing outcomes that are good-for-humans, harmless, honest, etc.”?
Bumping this in case you have more energy to engage now!