LawrenceC comments on Outer vs inner misalignment: three framings

LawrenceC 26 Aug 2022 18:10 UTC
LW: 1 AF: 1
0
AF
It seems like a core part of this initial framing relies on the operationalisation of “competent”, yet you don’t really point to what you mean. Notably, “competent” cannot mean “high-reward” (because of category 4) and “competent” cannot mean “desirable” (because of category 3 and 4). Instead you point at something like “Whatever it’s incentivized to do, it’s reasonably good at accomplishing it”.
I think here, competent can probably be defined in one of two (perhaps equivalent) ways:
1. Restricted reward spaces/informative priors over reward functions: as the appropriate folk theorem goes, any policy is optimal according to some reward function. “Most” policies are incompetent; consequently, many reward functions incentivize behavior that seems incoherent/incompetent to us. It seems that when I refer to a particular agent’s behavior as “competent”, I’m often making reference to the fact that it achieves high reward according to a “reasonable” reward function that I can imagine. Otherwise, the behavior just looks incoherent. This is similar to the definition used in Langosco, Koch, Sharkey et al’s goal misgeneralization paper, which depends on a non-trivial prior over reward functions.
2. Demonstrates instrumental convergence/power seeking behavior. In environments with regularities, certain behaviors are instrumentally convergent/power seeking. That is, they’re likely to occur for a large class of reward functions. To evaluate if behavior is competent, we can look for behavior that seem power-seeking to us (i.e., not dying in a game). Incompetent behavior is that which doesn’t exhibit power-seeking or instrumentally convergent drives.

The reason these two can be equivalent is the aforementioned folk theorem: as every policy has a reward function that rationalizes it, there exists priors over reward functions where the implied prior over optimal policies doesn’t demonstrate power seeking behavior.