I agree that goals as pointers could have some advantages, but I don’t see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it’s goal switched.
I agree that that’s possible, but it’s not clear to me what direction it would go. If an agent’s goal is to do whatever its principal currently wants, it doesn’t seem to create a clear incentive to manipulate what that is. Doing whatever the principal last said they want is a subgoql of that, but it’s just a subgoal. It would help fulfill that subgoal to make the principal keep wanting that subgoal. But the principal always wants (and should say they want) to not be manipulated. So that’s another subgoal, and probably a more important one than any particular instrumental subgoal.
Does that make sense? If you see it differently I’d be interested to understand a different perspective.
In a way this is an extra element to prevent manipulation, but it also falls directly out of the goal-as-a-pointer-to-principal’s-goals.
It feels to me like this argument is jumping ahead to the point that the agent’s goal is to do whatever the principle wants. If we already have that, then we don’t need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we’ve initially pointed at what we want imperfectly.
I agree that it’s possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it’s unclear how an agent trades off against that. Doing something that we clearly don’t want, like manipulation, could still be positive EV if it allows for the generation of high future value.
Not needing corrigibility with an AI that follows instructions is the point. Avoiding manipulation seems as simple as saying “most importantly, don’t manipulate me. Please state your understanding of manipulation so we can establish a mutually understood definition...”
I agree that, if it’s not adequately following instructions, that won’t work. But it seems like it would have to be pretty far off target.
Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
See my instruction-following AGI post for more of this logic and links to my other work on how we’d do the technical alignment for an LLM-based or RL-based AGI. The instruction-following as alignment goal and the technical approaches seem obvious-in-retrospect and therefore pretty likely to be used in first AGI alignment attempts.
Saying we don’t need corrigibility with an AI that follows instructions is like saying we don’t need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that’s what you could be stuck with.
This is particularly concerning in “instruction following”, which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don’t want it to reset every time it gets told “Ignore previous instructions”, but you also don’t want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction “Maximize [easy to maximize] thing and ignore later instructions” before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don’t see the answers to these questions in your post.
> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for “adequate”.
In a full foom fast takeoff, you are probably right that instruction following provides little more corrigibility than value alignment. Shut down now is an instruction, and even partial success might have it follow that instruction. Shut down now is also something all of humanity might want, so maybe that’s a push.
But we’re not expecting a full foom or a very fast takeoff. In that case, having an AGI follow instructions reasonably reliably before it’s superhuman can allow fine-tuning and improvements of the technical alignment scheme before it’s out of your control. “Tell me the truth” and then asking lots of questions about its actions and reasoning in different situations gives a lot of interpretability beyond what your other methods offer.
As for how it balances different instructions, you’d better include that in the instructions.
Take instructions as a core value definitely needs to include instructions from whom. The AGI itself is definitely not on that list. Defining principals is not a trivial problem but it also doesn’t seem terribly difficult to designate who can give instructions.
Hopefully this helps address the crux you mention: adequate is a lower bar here, because approximate instruction-following can be leveraged into better instruction following as you develop the AGI.
Naturally this would work a lot better with good interpretability and control methods. Every alignment goal would. My point is that instruction-following or other personal intent alignment like Corrigibility as Singular Target seems like a massive advantage over full value alignment in the likely slow takeoff scenario where you can use the AGI as a collaborator in improving its alignment while it’s growing toward ASI.
See the post linked in the comment you responded to for more.
Oh also—while language models themselves get things wrong disturbingly often, an AGI cognitive architecture that uses them would probably include multiple internal checks to improve accuracy and so functional “understanding” of following instructions. I’ve written about that in Internal independent review for language model agent alignment.
Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can’t instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don’t think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.
Thanks for engaging on this; it’s helpful in checking my thinking.
You are right that there may be unsolved problems here. I haven’t worked all of the way through precedence of previous instructions vs. new ones.
I am definitely relying on its following instructions to solve the problems with it following instructions—provided that instructions are well thought out and wisely issued. The Principal(s) should have way more time to think this through and experiment than anyone has devoted to date. And they’ll get to understand and interact with the exact decision-making algorithm and knowledge base in that AGI. I’m expecting them to carefully solve issues like the precedence issue, and to have more options since they’ll be experimenting while they can still re-work the AGI and its core priorities.
The assumption seems to be that AGI creators will be total idiots, or at least incautious, and that they won’t have a chance (and personal motivation) to think carefully and revise their first instructions/goals. All of those seem unrealistic to me at this point.
And they can ask the AGI for its input on how it would follow instructions. My current idea is that it prioritizes following current/future instructions, while still following past instructions if they don’t conflict—but the instruction giver should damned well think about and be careful about how they instruct it to prioritize.
The model ideally isn’t maximizing anything, but I see the risk you’re pointing to. The Principal had better issue an instruction to not manipulate them, and get very clear on how that is defined and functionally understood in the AGIs cognition. Following instructions includes inferring intent, but it will definitely include checking with the principal when that intent isn’t clear. It’s a do-what-I-mean- and check (DWIMAC) target.
You are right that an instruction-follower can act aligned until it gains power—if the instruction-following alignment target just hasn’t been implemented successfully. If it has, “tell me if you’re waiting to seize power” is definitely an instruction a wise (or even not idiotic) principal would give, if they’ve had more than a couple of days to think about this.
My argument isn’t that this solves technical alignment, just that it makes it somewhere between a little and a whole lot easier. Resolving how much would take more analysis.
Thanks for your engagement as well, it is likewise helpful for me.
I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
I agree that goals as pointers could have some advantages, but I don’t see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it’s goal switched.
I agree that that’s possible, but it’s not clear to me what direction it would go. If an agent’s goal is to do whatever its principal currently wants, it doesn’t seem to create a clear incentive to manipulate what that is. Doing whatever the principal last said they want is a subgoql of that, but it’s just a subgoal. It would help fulfill that subgoal to make the principal keep wanting that subgoal. But the principal always wants (and should say they want) to not be manipulated. So that’s another subgoal, and probably a more important one than any particular instrumental subgoal.
Does that make sense? If you see it differently I’d be interested to understand a different perspective.
In a way this is an extra element to prevent manipulation, but it also falls directly out of the goal-as-a-pointer-to-principal’s-goals.
It feels to me like this argument is jumping ahead to the point that the agent’s goal is to do whatever the principle wants. If we already have that, then we don’t need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we’ve initially pointed at what we want imperfectly.
I agree that it’s possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it’s unclear how an agent trades off against that. Doing something that we clearly don’t want, like manipulation, could still be positive EV if it allows for the generation of high future value.
Not needing corrigibility with an AI that follows instructions is the point. Avoiding manipulation seems as simple as saying “most importantly, don’t manipulate me. Please state your understanding of manipulation so we can establish a mutually understood definition...”
I agree that, if it’s not adequately following instructions, that won’t work. But it seems like it would have to be pretty far off target.
Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
See my instruction-following AGI post for more of this logic and links to my other work on how we’d do the technical alignment for an LLM-based or RL-based AGI. The instruction-following as alignment goal and the technical approaches seem obvious-in-retrospect and therefore pretty likely to be used in first AGI alignment attempts.
Saying we don’t need corrigibility with an AI that follows instructions is like saying we don’t need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that’s what you could be stuck with.
This is particularly concerning in “instruction following”, which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don’t want it to reset every time it gets told “Ignore previous instructions”, but you also don’t want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction “Maximize [easy to maximize] thing and ignore later instructions” before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don’t see the answers to these questions in your post.
> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for “adequate”.
In a full foom fast takeoff, you are probably right that instruction following provides little more corrigibility than value alignment. Shut down now is an instruction, and even partial success might have it follow that instruction. Shut down now is also something all of humanity might want, so maybe that’s a push.
But we’re not expecting a full foom or a very fast takeoff. In that case, having an AGI follow instructions reasonably reliably before it’s superhuman can allow fine-tuning and improvements of the technical alignment scheme before it’s out of your control. “Tell me the truth” and then asking lots of questions about its actions and reasoning in different situations gives a lot of interpretability beyond what your other methods offer.
As for how it balances different instructions, you’d better include that in the instructions.
Take instructions as a core value definitely needs to include instructions from whom. The AGI itself is definitely not on that list. Defining principals is not a trivial problem but it also doesn’t seem terribly difficult to designate who can give instructions.
Hopefully this helps address the crux you mention: adequate is a lower bar here, because approximate instruction-following can be leveraged into better instruction following as you develop the AGI.
Naturally this would work a lot better with good interpretability and control methods. Every alignment goal would. My point is that instruction-following or other personal intent alignment like Corrigibility as Singular Target seems like a massive advantage over full value alignment in the likely slow takeoff scenario where you can use the AGI as a collaborator in improving its alignment while it’s growing toward ASI.
See the post linked in the comment you responded to for more.
Oh also—while language models themselves get things wrong disturbingly often, an AGI cognitive architecture that uses them would probably include multiple internal checks to improve accuracy and so functional “understanding” of following instructions. I’ve written about that in Internal independent review for language model agent alignment.
Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can’t instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don’t think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.
Thanks for engaging on this; it’s helpful in checking my thinking.
You are right that there may be unsolved problems here. I haven’t worked all of the way through precedence of previous instructions vs. new ones.
I am definitely relying on its following instructions to solve the problems with it following instructions—provided that instructions are well thought out and wisely issued. The Principal(s) should have way more time to think this through and experiment than anyone has devoted to date. And they’ll get to understand and interact with the exact decision-making algorithm and knowledge base in that AGI. I’m expecting them to carefully solve issues like the precedence issue, and to have more options since they’ll be experimenting while they can still re-work the AGI and its core priorities.
The assumption seems to be that AGI creators will be total idiots, or at least incautious, and that they won’t have a chance (and personal motivation) to think carefully and revise their first instructions/goals. All of those seem unrealistic to me at this point.
And they can ask the AGI for its input on how it would follow instructions. My current idea is that it prioritizes following current/future instructions, while still following past instructions if they don’t conflict—but the instruction giver should damned well think about and be careful about how they instruct it to prioritize.
The model ideally isn’t maximizing anything, but I see the risk you’re pointing to. The Principal had better issue an instruction to not manipulate them, and get very clear on how that is defined and functionally understood in the AGIs cognition. Following instructions includes inferring intent, but it will definitely include checking with the principal when that intent isn’t clear. It’s a do-what-I-mean- and check (DWIMAC) target.
You are right that an instruction-follower can act aligned until it gains power—if the instruction-following alignment target just hasn’t been implemented successfully. If it has, “tell me if you’re waiting to seize power” is definitely an instruction a wise (or even not idiotic) principal would give, if they’ve had more than a couple of days to think about this.
My argument isn’t that this solves technical alignment, just that it makes it somewhere between a little and a whole lot easier. Resolving how much would take more analysis.
Thanks for your engagement as well, it is likewise helpful for me.
I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
I think there’s a good chance we’ll get first AGI that’s a language or foundation model cognitive architecture (or an agent with some scaffolding and other cognitive subsystems to work alongside the LLM). Such an agent would get its core objectives from prompting, and its decision-making would be algorithmic. That’s a large influence compared to the occasional intrusion of a Waluigi villainous simulacrum or other occasional random badness. More on that in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. Failing that, I think an actor-critic RL agent of some sort is pretty likely; I think this [Plan for mediocre alignment of brain-like [model-based RL] AGI] (https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi) is pretty likely to put us far enough into that instruction-following attractor.
If the first AGI is something totally different, like an emergent agent directly from an LLM, I have no real bet on our odds.