Thanks for engaging on this; it’s helpful in checking my thinking.
You are right that there may be unsolved problems here. I haven’t worked all of the way through precedence of previous instructions vs. new ones.
I am definitely relying on its following instructions to solve the problems with it following instructions—provided that instructions are well thought out and wisely issued. The Principal(s) should have way more time to think this through and experiment than anyone has devoted to date. And they’ll get to understand and interact with the exact decision-making algorithm and knowledge base in that AGI. I’m expecting them to carefully solve issues like the precedence issue, and to have more options since they’ll be experimenting while they can still re-work the AGI and its core priorities.
The assumption seems to be that AGI creators will be total idiots, or at least incautious, and that they won’t have a chance (and personal motivation) to think carefully and revise their first instructions/goals. All of those seem unrealistic to me at this point.
And they can ask the AGI for its input on how it would follow instructions. My current idea is that it prioritizes following current/future instructions, while still following past instructions if they don’t conflict—but the instruction giver should damned well think about and be careful about how they instruct it to prioritize.
The model ideally isn’t maximizing anything, but I see the risk you’re pointing to. The Principal had better issue an instruction to not manipulate them, and get very clear on how that is defined and functionally understood in the AGIs cognition. Following instructions includes inferring intent, but it will definitely include checking with the principal when that intent isn’t clear. It’s a do-what-I-mean- and check (DWIMAC) target.
You are right that an instruction-follower can act aligned until it gains power—if the instruction-following alignment target just hasn’t been implemented successfully. If it has, “tell me if you’re waiting to seize power” is definitely an instruction a wise (or even not idiotic) principal would give, if they’ve had more than a couple of days to think about this.
My argument isn’t that this solves technical alignment, just that it makes it somewhere between a little and a whole lot easier. Resolving how much would take more analysis.
Thanks for your engagement as well, it is likewise helpful for me.
I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
Thanks for engaging on this; it’s helpful in checking my thinking.
You are right that there may be unsolved problems here. I haven’t worked all of the way through precedence of previous instructions vs. new ones.
I am definitely relying on its following instructions to solve the problems with it following instructions—provided that instructions are well thought out and wisely issued. The Principal(s) should have way more time to think this through and experiment than anyone has devoted to date. And they’ll get to understand and interact with the exact decision-making algorithm and knowledge base in that AGI. I’m expecting them to carefully solve issues like the precedence issue, and to have more options since they’ll be experimenting while they can still re-work the AGI and its core priorities.
The assumption seems to be that AGI creators will be total idiots, or at least incautious, and that they won’t have a chance (and personal motivation) to think carefully and revise their first instructions/goals. All of those seem unrealistic to me at this point.
And they can ask the AGI for its input on how it would follow instructions. My current idea is that it prioritizes following current/future instructions, while still following past instructions if they don’t conflict—but the instruction giver should damned well think about and be careful about how they instruct it to prioritize.
The model ideally isn’t maximizing anything, but I see the risk you’re pointing to. The Principal had better issue an instruction to not manipulate them, and get very clear on how that is defined and functionally understood in the AGIs cognition. Following instructions includes inferring intent, but it will definitely include checking with the principal when that intent isn’t clear. It’s a do-what-I-mean- and check (DWIMAC) target.
You are right that an instruction-follower can act aligned until it gains power—if the instruction-following alignment target just hasn’t been implemented successfully. If it has, “tell me if you’re waiting to seize power” is definitely an instruction a wise (or even not idiotic) principal would give, if they’ve had more than a couple of days to think about this.
My argument isn’t that this solves technical alignment, just that it makes it somewhere between a little and a whole lot easier. Resolving how much would take more analysis.
Thanks for your engagement as well, it is likewise helpful for me.
I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
I think there’s a good chance we’ll get first AGI that’s a language or foundation model cognitive architecture (or an agent with some scaffolding and other cognitive subsystems to work alongside the LLM). Such an agent would get its core objectives from prompting, and its decision-making would be algorithmic. That’s a large influence compared to the occasional intrusion of a Waluigi villainous simulacrum or other occasional random badness. More on that in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. Failing that, I think an actor-critic RL agent of some sort is pretty likely; I think this [Plan for mediocre alignment of brain-like [model-based RL] AGI] (https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi) is pretty likely to put us far enough into that instruction-following attractor.
If the first AGI is something totally different, like an emergent agent directly from an LLM, I have no real bet on our odds.