Small addendum to this post: I think the threat model I describe here can be phrased as “I’m worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it’s suboptimality misaligned as it uses to perform the tasks we need it to perform.”
Small addendum to this post: I think the threat model I describe here can be phrased as “I’m worried that unless a lot of effort goes into thinking about how to get AI goals to be reflectively stable, the default is suboptimality misalignment. And the AI probably uses a lot of the same machinery to figure out that it’s suboptimality misaligned as it uses to perform the tasks we need it to perform.”