Koen.Holtman comments on Formal Solution to the Inner Alignment Problem

Koen.Holtman 5 Mar 2021 13:49 UTC
LW: 1 AF: 1
0
AF

I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.

I have seen you mention a number of times in this comment thread that ‘this is not a problem because eventually the bad/wrong policies will disappear from the top set’. You have not qualified this statement with ‘but we need a very low $α$ like $α < 1 / | Π |$ to make this work in a safe way’, so I remain somewhat uncertain about your views are about how low $α$ needs to go.

In any case, I’ll now try to convince you that if $α > 1 / | Π |$ , your statement that ‘when they’re wrong they’ll get removed from the posterior’ will not always mean what you might want it to mean.

Is the demonstrator policy $π^{d}$ to get themselves killed?

The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.

Say that $π^{p} \in Π$ is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where $α > 1 / | Π |$ and where at time step $t$ , we have the likely case that $π^{p} \notin Π_{h < t}^{α}$ . In the world I constructed for the counterexample, the remaining top policies $Π_{h < t}^{α}$ now perform a synchronized treacherous turn where they kill the demonstrator.

In time step $t + 1$ and later, the policies $Π_{h < t + 1}^{α}$ diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the $n u l l$ action. This eventually removes all ‘wrong’ policies from $Π_{h < t + 1 + i}^{α}$ , where ‘wrong’ means that they do not take the $n u l l$ action at all future time steps.

The silver lining is perhaps that at least the agent will eventually stop, perform $n u l l$ actions only, after it has killed the demonstrator.

Now. the paper proves that the behavior of the agent policy $π_{α}^{i}$ will approximate that of the true demonstrator policy $π^{d}$ closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy $π^{d}$ had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy $π^{d}$ is one where the demonstrator always intended to get themselves killed.

This would be a somewhat unusual solution to the inner alignment problem.

The math in the paper has you working in a fixed-policy setting where the demonstrator policy $π^{d}$ is immutable/time-invariant. The snag is that this does not imply that the policy $π^{d}$ defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force $π^{d}$ into a self-fulfilling trajectory where it kills the demonstrator.

Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.

Koen.Holtman comments on Formal Solution to the Inner Alignment Problem

Is the demonstrator policy πd to get themselves killed?

Is the demonstrator policy $π^{d}$ to get themselves killed?