Thanks for your engagement as well, it is likewise helpful for me.
I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
Thanks for your engagement as well, it is likewise helpful for me.
I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
I think there’s a good chance we’ll get first AGI that’s a language or foundation model cognitive architecture (or an agent with some scaffolding and other cognitive subsystems to work alongside the LLM). Such an agent would get its core objectives from prompting, and its decision-making would be algorithmic. That’s a large influence compared to the occasional intrusion of a Waluigi villainous simulacrum or other occasional random badness. More on that in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. Failing that, I think an actor-critic RL agent of some sort is pretty likely; I think this [Plan for mediocre alignment of brain-like [model-based RL] AGI] (https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi) is pretty likely to put us far enough into that instruction-following attractor.
If the first AGI is something totally different, like an emergent agent directly from an LLM, I have no real bet on our odds.