Jonathan Paulson comments on Why Instrumental Goals are not a big AI Safety Problem

Jonathan Paulson 9 Apr 2022 4:11 UTC
1 point
On (3): I don’t expect an agentic AI to consist of a single deep learning model. For concreteness, let me sketch an architecture:

We have some robot R that we want an AI to run. R will supply the AI with sensory input and take actions that the AI recommends. R is like the human body and the AI is like the brain.

AI A is trained to operate R using whatever method you like. It ends up with some goal. AI B is trained as an “overseer” alongside A; it takes in actions recommended by A and decides whether to execute or reject them; B is supposed to reject actions which disable R’s off switch.

After A and B are trained, R is deployed. In deployment, A does not have direct control over R. Instead, the actions recommended by A are filtered through B.

I would say the “agentic AI” here is “A+B+R”, which consists of two deep learning models A and B and some hand-coded non-trained piece R. A might have goals, but the system “A+B+R” does not.

Maybe you would say “well actually A is the agentic AI and it has goals. B+R are just part of the environment that it has to deal with”. But A cannot do anything without B’s permission, and B is just as capable as A, and B was trained specifically to deal with A, and A had no idea it would have to deal with B during training. I claim this makes A significantly safer than if it was able to operate R by itself.
- Ian McKenzie 9 Apr 2022 20:57 UTC
  1 point
  Parent
  A couple of things that come to mind:
  - You’ll run into ELK-like difficulties when trying to train B: maybe it looks to the human that the off switch is still accessible but in reality you would not be able to activate it for some reason—A steers into a Faraday cage or something more clever.
  - (In response to your reply to acylhalide) Assuming you solve the training difficulties, B has to be at least as good at modelling the world as A: it needs to be able to predict the consequences of action sequences from A to see if the off switch is accessible.
  - (In response to your reply to Evan) I don’t necessarily expect A+B+R to be as capable as A+R: adding B introduces a huge shift to the environment that A has to deal with. If A always disabled the off switch in regular training, and now is unable to execute action sequences that involve that, then it may not be able to act effectively.
- Evan R. Murphy 9 Apr 2022 5:45 UTC
  1 point
  Parent
  Ah I see, you’re using a collection of narrower systems and oversight to try and provide safety. Well there are some proposals like this which provide for an AGI that’s not agentic and may have better safety properties. Eric Drexler’s CAIS comes to mind.
  
  But if a proposal is going to be implemented by a major AI lab then it needs to be competitive too. I don’t think it’s clear that systems like this are competitive with agentic systems. So in the kinds of advanced AI we are still likely to see implemented in the real world, instrumental convergence is still very much a concern.
  - Jonathan Paulson 9 Apr 2022 14:22 UTC
    1 point
    Parent
    Why isn’t it competitive? A is being trained the same way as an agentic system, so it will be competitive.
    
    Adding B is a 2x runtime/training-cost overhead, so there is a “constant factor” cost; is that enough to say something is “not competitive”? In practice I’d expect you could strike a good safety/overhead balance for much less.
    - Evan R. Murphy 9 Apr 2022 22:52 UTC
      1 point
      Parent
      Hmm well if A is being trained the same way using deep learning toward being an agentic system, then it is subject to mesa-optimization and having goals, isn’t it? And being subject to mesa-optimization, do you have a way to address inner misalignment failures like deceptive alignment? Oversight alone can be thwarted by a deceptively-aligned mesa-optimizer.
      You might possibly address this if you give the overseer good enough transparency tools. But such tools don’t exist yet.
- [ ]
  [deleted]
  - Jonathan Paulson 9 Apr 2022 14:27 UTC
    1 point
    Parent
    A > B > human. I expect B < human would also be quite useful.
    
    B does not have a lot of opportunity for action—all it can do is prevent A from acting. It seems like its hard to “eliminate humans” with just that freedom. I agree B has an incentive to hamper A.
    
    Even a B that is dumber than humans is valuable, because it is faster than humans. B can provide real-time feedback on every action that A takes, whereas consulting humans for every action would be impractical—it costs too much human time and it would prevent A from operating in realtime.
    - [ ]
      [deleted]