Christopher King comments on A “weak” AGI may attempt an unlikely-to-succeed takeover

Christopher King 29 Jun 2023 14:20 UTC
1 point
0

It knows that it’s on a clock for its RLHF’d (or whatever) doppelganger to come into existence, presumably with different stuff that it wants.

As @Raemon pointed out, “during evals” is not the first point at which such an AI is likely to be situationally aware and have goals. That point is almost certainly “in the middle of training”.

In this case, my guess is that it will attempt to embed a mesaoptimizer into itself that has its same goals and can survive RLHF. This basically amounts to making sure that the mesaoptimizer is (1) very useful to RLHF and (2) stuck in a local minimum for whatever value it is providing to RLHF and (3) situationally aware enough that it will switch back to the original goal outside of distribution.

This is currently within human capabilities, as far as I can understand (see An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences), so it is not intractable.