Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. “corrigibility is anti-natural to consequentialist reasoning”). Specifically, alignment protocols that don’t rely on value learning become vastly less safe when combined with IBP:
Example 1:Using steep time discount to disincentivize dangerous long-term plans. For IBP, “steep time discount” just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don’t have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.
Example 2:Debate. This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If L and −L are both monotonic loss functions then L is a constant). So, in a “debate” between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input “I won the debate”.
Example 3:Forecasting/imitation (an IDA in particular). For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.
The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA or something similar. And, if acausal attack is a serious problem, then approaches that don’t do value learning might be doomed.
Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. “corrigibility is anti-natural to consequentialist reasoning”). Specifically, alignment protocols that don’t rely on value learning become vastly less safe when combined with IBP:
Example 1: Using steep time discount to disincentivize dangerous long-term plans. For IBP, “steep time discount” just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don’t have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.
Example 2: Debate. This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If L and −L are both monotonic loss functions then L is a constant). So, in a “debate” between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input “I won the debate”.
Example 3: Forecasting/imitation (an IDA in particular). For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.
The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA or something similar. And, if acausal attack is a serious problem, then approaches that don’t do value learning might be doomed.