Seth Herd comments on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Seth Herd 1 Aug 2024 20:24 UTC
2 points
0
In a full foom fast takeoff, you are probably right that instruction following provides little more corrigibility than value alignment. Shut down now is an instruction, and even partial success might have it follow that instruction. Shut down now is also something all of humanity might want, so maybe that’s a push.

But we’re not expecting a full foom or a very fast takeoff. In that case, having an AGI follow instructions reasonably reliably before it’s superhuman can allow fine-tuning and improvements of the technical alignment scheme before it’s out of your control. “Tell me the truth” and then asking lots of questions about its actions and reasoning in different situations gives a lot of interpretability beyond what your other methods offer.

As for how it balances different instructions, you’d better include that in the instructions.

Take instructions as a core value definitely needs to include instructions from whom. The AGI itself is definitely not on that list. Defining principals is not a trivial problem but it also doesn’t seem terribly difficult to designate who can give instructions.

Hopefully this helps address the crux you mention: adequate is a lower bar here, because approximate instruction-following can be leveraged into better instruction following as you develop the AGI.

Naturally this would work a lot better with good interpretability and control methods. Every alignment goal would. My point is that instruction-following or other personal intent alignment like Corrigibility as Singular Target seems like a massive advantage over full value alignment in the likely slow takeoff scenario where you can use the AGI as a collaborator in improving its alignment while it’s growing toward ASI.

See the post linked in the comment you responded to for more.

Oh also—while language models themselves get things wrong disturbingly often, an AGI cognitive architecture that uses them would probably include multiple internal checks to improve accuracy and so functional “understanding” of following instructions. I’ve written about that in Internal independent review for language model agent alignment.
- Rubi J. Hudson 1 Aug 2024 22:50 UTC
  3 points
  0
  Parent
  Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can’t instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don’t think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.
  - Seth Herd 1 Aug 2024 23:18 UTC
    2 points
    0
    Parent
    Thanks for engaging on this; it’s helpful in checking my thinking.
    You are right that there may be unsolved problems here. I haven’t worked all of the way through precedence of previous instructions vs. new ones.
    I am definitely relying on its following instructions to solve the problems with it following instructions—provided that instructions are well thought out and wisely issued. The Principal(s) should have way more time to think this through and experiment than anyone has devoted to date. And they’ll get to understand and interact with the exact decision-making algorithm and knowledge base in that AGI. I’m expecting them to carefully solve issues like the precedence issue, and to have more options since they’ll be experimenting while they can still re-work the AGI and its core priorities.
    The assumption seems to be that AGI creators will be total idiots, or at least incautious, and that they won’t have a chance (and personal motivation) to think carefully and revise their first instructions/goals. All of those seem unrealistic to me at this point.
    And they can ask the AGI for its input on how it would follow instructions. My current idea is that it prioritizes following current/future instructions, while still following past instructions if they don’t conflict—but the instruction giver should damned well think about and be careful about how they instruct it to prioritize.
    The model ideally isn’t maximizing anything, but I see the risk you’re pointing to. The Principal had better issue an instruction to not manipulate them, and get very clear on how that is defined and functionally understood in the AGIs cognition. Following instructions includes inferring intent, but it will definitely include checking with the principal when that intent isn’t clear. It’s a do-what-I-mean- and check (DWIMAC) target.
    You are right that an instruction-follower can act aligned until it gains power—if the instruction-following alignment target just hasn’t been implemented successfully. If it has, “tell me if you’re waiting to seize power” is definitely an instruction a wise (or even not idiotic) principal would give, if they’ve had more than a couple of days to think about this.
    My argument isn’t that this solves technical alignment, just that it makes it somewhere between a little and a whole lot easier. Resolving how much would take more analysis.
    - Rubi J. Hudson 3 Aug 2024 0:51 UTC
      1 point
      0
      Parent
      Thanks for your engagement as well, it is likewise helpful for me.
      I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
      I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
      - Seth Herd 5 Aug 2024 3:58 UTC
        2 points
        0
        Parent
        I think we’ve reached convergence. Whether that valley of corrigibility is likely to be large enough is all we disagree on AFAICT. I think that will depend on exactly the AGI architecture, and how wisely the creators instruct it.
        
        I think there’s a good chance we’ll get first AGI that’s a language or foundation model cognitive architecture (or an agent with some scaffolding and other cognitive subsystems to work alongside the LLM). Such an agent would get its core objectives from prompting, and its decision-making would be algorithmic. That’s a large influence compared to the occasional intrusion of a Waluigi villainous simulacrum or other occasional random badness. More on that in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. Failing that, I think an actor-critic RL agent of some sort is pretty likely; I think this [Plan for mediocre alignment of brain-like [model-based RL] AGI] (https://www.alignmentforum.org/posts/Hi7zurzkCog336EC2/plan-for-mediocre-alignment-of-brain-like-model-based-rl-agi) is pretty likely to put us far enough into that instruction-following attractor.
        
        If the first AGI is something totally different, like an emergent agent directly from an LLM, I have no real bet on our odds.