Seth Herd comments on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Seth Herd 5 Aug 2024 3:58 UTC
2 points
0
I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.

I think there’s a good chance we’ll get first AGI that’s a language or foundation model cognitive architecture (or an agent with some scaffolding and other cognitive subsystems to work alongside the LLM). Such an agent would get its core objectives from prompting, and its decision-making would be algorithmic. That’s a large influence compared to the occasional intrusion of a Waluigi villainous simulacrum or other occasional random badness. More on that in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. Failing that, I think an actor-critic RL agent of some sort is pretty likely; I think this [Plan for mediocre alignment of brain-like [model-based RL] AGI] (https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi) is pretty likely to put us far enough into that instruction-following attractor.

If the first AGI is something totally different, like an emergent agent directly from an LLM, I have no real bet on our odds.