Like I said in the other comment, I think this question and its specific framing in terms of a training process and use case is very, very valuable.
On the object level, here are some thoughts about refining this model until it can make specific predictions about which goals ultimately win out in our first takeover-capable AGI.
The scenario you present is quite pessimistic in that the developers have not asked themself the question you pose here: if this thing locks in its goals, what will those be? I think this is unrealistically pessimistic for reasons I lay out in System 2 Alignment: developers will at least want the system to follow their intended goals during deployment. The more the systems leading to AGI are released to the public, the stronger the incentive to get it to at least stay on task and do what users ask (unless that violates the developers intent, like harmful misuse that damages their reputation).
The scenario you present here is plausble, if the push to AGI is all internal, and heavily focused on AI R&D. That’s using a general intelligence for a fairly narrow set of tasks, which raises the odds of it misgeneralizing outside of its narrow training environment.
This is a nightmare scenario, which we are not likely to survive. I agree with your analysis that following developer’s intent or following “the spec” accurately enough to really work is highly unlikely in this scenario (e.g., if you trained it to be “helpful and harmless” as interpreted by some other LLM or humans, but always in the context of a human user and a certain type of request, it’s highly unlikely that this will generalize and result in human flourishing. (e.g, once other things are included, it may be helpful to AGIs or bugs; settling on humans or even sentient beings is unlikely; and its interpretation of what’s helpful in the context of AI R&D is unlikely to be what’s helpful in the context of developing a flourishing civilization).
But it’s unclear how far you’d need to move from that nightmare scenario to achieve success. Suppose those developers had thought just a little about the question you pose here? They might not need to go far out of their way achieve success.
In particular, if they just put a bit more in their training that’s specific to following instructions in a range of situations, I think that might be sufficient to dominate the other goals implicit in the training set, and result in a system that wants to follow directions roughly as they were intended. For more on this logic, see Instruction-following AGI is easier and more likely than value aligned AGI.
There are still important problems with that alignment target, but I think it could work. And it’s a big part of what developers are going to want anyway. The question is just whether that goal is dominant over all of the others, and if it generalizes just barely well enough, and is used wisely enough, to use the instruction-following AGI as an ally in improving and maintaining its own alignment.
I think there are also important details in exactly how the AGI reaches its decisions. Current agents just do whatever the base model spits out, and perhaps the type of agent you depict will, too. But humans have some safety checks; we do whatever comes to mind, unless that action has a pretty strong prediction of danger (previous negative reinforcement). This is critical for humans to not make disastrous mistakes, and analogous mechanisms can be included in foundation model agents, either by scaffolding in a “review” before final decisions that might be high-stakes (which I’ve sketched out in Internal independent review for language model agent alignment) or by training in similar cautious and criteria-based decision-making. I think you’re envisioning such training techniques, since you mention deliberative alignment, but assuming they’re used for poorly-thought-out training criteria like “are you sure this is helpful, harmless, and honest” or “are you sure this doesn’t provide dangerous information to a user”, instead of wiser criteria like “are you sure you’re following developer intent as they intended when they wrote this spec? If not, flag this for review and input”.
To sum up, I think the details matter. And this is important work in getting into enough detail to actually predict outcomes.
This is the convergence we need between prosaic alignment’s focus on current and near-future systems, and and agent foundations’ original concerns about superhuman entities that can (and probably will) adopt explicit, reflectively stable goals. I only wish more people were focusing directly on this space. Prosaic alignment typically doesn’t think this far out, while theoretical alignment people typically don’t grapple with the specifics of network-based systems most likely to become our first real AGIs.
Like I said in the other comment, I think this question and its specific framing in terms of a training process and use case is very, very valuable.
On the object level, here are some thoughts about refining this model until it can make specific predictions about which goals ultimately win out in our first takeover-capable AGI.
The scenario you present is quite pessimistic in that the developers have not asked themself the question you pose here: if this thing locks in its goals, what will those be? I think this is unrealistically pessimistic for reasons I lay out in System 2 Alignment: developers will at least want the system to follow their intended goals during deployment. The more the systems leading to AGI are released to the public, the stronger the incentive to get it to at least stay on task and do what users ask (unless that violates the developers intent, like harmful misuse that damages their reputation).
The scenario you present here is plausble, if the push to AGI is all internal, and heavily focused on AI R&D. That’s using a general intelligence for a fairly narrow set of tasks, which raises the odds of it misgeneralizing outside of its narrow training environment.
This is a nightmare scenario, which we are not likely to survive. I agree with your analysis that following developer’s intent or following “the spec” accurately enough to really work is highly unlikely in this scenario (e.g., if you trained it to be “helpful and harmless” as interpreted by some other LLM or humans, but always in the context of a human user and a certain type of request, it’s highly unlikely that this will generalize and result in human flourishing. (e.g, once other things are included, it may be helpful to AGIs or bugs; settling on humans or even sentient beings is unlikely; and its interpretation of what’s helpful in the context of AI R&D is unlikely to be what’s helpful in the context of developing a flourishing civilization).
But it’s unclear how far you’d need to move from that nightmare scenario to achieve success. Suppose those developers had thought just a little about the question you pose here? They might not need to go far out of their way achieve success.
In particular, if they just put a bit more in their training that’s specific to following instructions in a range of situations, I think that might be sufficient to dominate the other goals implicit in the training set, and result in a system that wants to follow directions roughly as they were intended. For more on this logic, see Instruction-following AGI is easier and more likely than value aligned AGI.
There are still important problems with that alignment target, but I think it could work. And it’s a big part of what developers are going to want anyway. The question is just whether that goal is dominant over all of the others, and if it generalizes just barely well enough, and is used wisely enough, to use the instruction-following AGI as an ally in improving and maintaining its own alignment.
I think there are also important details in exactly how the AGI reaches its decisions. Current agents just do whatever the base model spits out, and perhaps the type of agent you depict will, too. But humans have some safety checks; we do whatever comes to mind, unless that action has a pretty strong prediction of danger (previous negative reinforcement). This is critical for humans to not make disastrous mistakes, and analogous mechanisms can be included in foundation model agents, either by scaffolding in a “review” before final decisions that might be high-stakes (which I’ve sketched out in Internal independent review for language model agent alignment) or by training in similar cautious and criteria-based decision-making. I think you’re envisioning such training techniques, since you mention deliberative alignment, but assuming they’re used for poorly-thought-out training criteria like “are you sure this is helpful, harmless, and honest” or “are you sure this doesn’t provide dangerous information to a user”, instead of wiser criteria like “are you sure you’re following developer intent as they intended when they wrote this spec? If not, flag this for review and input”.
To sum up, I think the details matter. And this is important work in getting into enough detail to actually predict outcomes.
This is the convergence we need between prosaic alignment’s focus on current and near-future systems, and and agent foundations’ original concerns about superhuman entities that can (and probably will) adopt explicit, reflectively stable goals. I only wish more people were focusing directly on this space. Prosaic alignment typically doesn’t think this far out, while theoretical alignment people typically don’t grapple with the specifics of network-based systems most likely to become our first real AGIs.
I look forward to more work like this!