Johannes Treutlein comments on Disentangling inner alignment failures

Johannes Treutlein 10 Oct 2022 21:49 UTC
LW: 3 AF: 2
1
AF
I like this post and agree that there are different threat models one might categorize broadly under “inner alignment”. Before reading this I hadn’t reflected on the relationship between them.
Some random thoughts (after an in-person discussion with Erik):
- For distributional shift and deception, there is a question of what is treated as fixed and what is varied when asking whether a certain agent has a certain property. E.g., I could keep the agent constant but put it into a new environment, and ask whether it is still aligned. Or I could keep the environment constant but “give the agent more capabilities”. Or I could only change some random number generator’s input or output and observe what changes. The question of what I’m allowed to change to figure out whether the agent could do some unaligned thing in a new condition is really important; e.g., if I can change everything about the agent, the question becomes meaningless.
- One can define deception as a type of distributional shift. E.g., define agents as deterministic functions. We model different capabilities via changing the environment (e.g. giving it more options) and treat any potential internal agent state and randomness as additional inputs to the function. In that case, if I can test the function on all possible inputs, there is no way for the agent to be unaligned. And deception is a case where the distributional shift can be extremely small and still lead to very different behavior. An agent that is “continuous” in the inputs cannot be deceptive (but it can still be unaligned after distributional shift in general).
- It is a bit unclear to me what exactly the sharp left turn means. It is not a property that an agent can have, like inner misalignment or deceptiveness. One interpretation would be that it is an argument for why AIs will become deceptive (they suddenly realize that being deceptive is optimal for their goals, even if they don’t suddenly get total control over the world). Another interpretation would be that it is an argument why we will get x-risks, even without deception (because the transition from subhuman to superhuman capabilities happens so fast that we aren’t able to correct any inner misalignment before it’s too late).
- One takeaway from the second interpretation of the sharp left turn argument could be that you need to have really fine-grained supervision of the AI, even if it is never deceptive, just because it could go from not thinking about taking over the world to taking over the world in just a few gradient descent steps. Or instead of supervising only gradient descent steps, you would also need to supervise intermediate results of some internal computation in a fine-grained way.
- It does seem right that one also needs to supervise intermediate results of internal computation. However, it probably makes sense to focus on avoiding deception, as deceptiveness would be the main reason why supervision could go wrong.
- Erik Jenner 10 Oct 2022 22:45 UTC
  LW: 2 AF: 2
  1
  AF Parent
  Thanks for the comments!
  One can define deception as a type of distributional shift. [...]
  I technically agree with what you’re saying here, but one of the implicit claims I’m trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.
  - Johannes Treutlein 10 Oct 2022 23:46 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.)
    I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would not go ahead and suddenly start maximizing some different goal function, but instead would query the human first). Stuart Armstrong might claim a similar thing about concept extrapolation?
    I personally think it is probably best to just try to work on deceptiveness directly instead of solving some more general problem and hoping non-deceptiveness is a side effect. It is probably harder to find a general solution than to solve only deceptiveness. Though maybe this depends on one’s beliefs about what is easy or hard to do with deep learning.