Koen.Holtman comments on Formal Solution to the Inner Alignment Problem

Koen.Holtman Mar 3, 2021, 3:44 PM
LW: 1 AF: 1
AF
[long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.]

If we set $α$ small enough, we can make it arbitrarily like[ly] that $π^{d}$ never leaves the set of top policies.

OK, I think you are kind of saved here, in the case of my counter-example, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific:

Say I refine the counter-example by adding the prior $w (π_{i}) = 1 / | Π |$ , so each candidate is considered equally likely from the start. Say that $| Π | = 10^{10}$ .

The key thing to note in my example is that, before the time-synchronized treacherous turn happens for all policies $π_{i} \neq π^{d}$ , all policies $π_{i}$ are equally compatible with events so far, so the posterior $w (π_{i} | h_{< t})$ is the same for all of them.

When I use $α = 10^{- 4}$ with these all-the-same-posterior policies, then (if I am reading equation (2) correctly) the set $Π_{h < t}^{α}$ of top policies will have size $N = 10^{4}$ . Ties are broken arbitrarily, so the real $π^{d}$ is unlikely to be among these top $N = 10^{4} << 10^{10}$ candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here.

However.

Case of $α < 1 / | Π |$ : much better than I thought!

However, and this is where things get interesting, if we use $α = 10^{11}$ , so $α < 1 / | Π |$ , then equation (2) will compute $Π_{h < t}^{α} = Π$ , the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn.

So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the $β$ based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counter-example above, or if you had a more general reason -- the paper does not say.

Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for $α < 1 / | Π |$ and for certain values of ‘the inner alignment problem’. I can also imagine several setups where $α < 1 / | Π |$ would still lead to reasonably fast learning (=reasonably fast drop in frequency of demonstrator requests).

This is what progress looks like. There are certain boundary conditions here that might be unrealistic: $Π$ needs to be a finite set and the real policy needs to be in there, but at least we have some well-defined boundary conditions we can now look at, where we can try to understand and interpret the math further.

Case of $α > 1 / | Π |$ : the challenge of designing a prior

In the case of using $α = 10^{- 4}$ or $α = 10^{- 9}$ , because you want the system to learn faster, you are instead looking at a setup where, if you want to solve the inner alignment problem, you will have to shape the prior values $w (π_{i})$ in such a way that $w^{d}$ gets a higher prior than the all the treacherous policies.

My counter-example above shows that if you are using the 1/(number of bits in the program that computes $π_{i}$ ) as the prior, you will not get what you need. There will be plenty of functions $R_{i}$ that have fewer bits in their encoding than the real function $R^{d}$ used by the demonstrator.

Earlier in this comment section, there is a whole sub-thread with speculation on the number of bits needed too encode benign vs. treacherous policies, but for me that discussion does not yet arrive at any place where I would get convinced that the problem of assigning higher priors to benign vs. treacherous policies has been made tractable yet. (Vanessa has identified some additional moving parts however.)

There is of course a tradition in the AI safety community where this is made ‘tractable’ by the device of polling many AI researchers to ask them whether they think that bits(benign policy)<bits(treacherous policy) for future ML systems, and then graphing the results, but this is not what I would call a tractable solution.

What I would call tractable is a solution like the one, for a much simpler case, in section 10.2.4 of my paper Counterfactual Planning in AGI Systems. I show there that random exploration can be used to make sure that bits(agent environment model which includes unwanted self-knowledge about agent compute core internals) $≫$ bits(agent environment model that lacks this unwanted self-knowledge), no matter what the encoding. Extending this to the bits(benign policy) case would be nice, but I can’t immediately see a route here.

My answer to the above hypothetical bits(benign policy)<bits(treacherous policy) poll is that we cannot expect this to be true any possible encoding of policies (see counter-example above), but it might be true for some encodings. Figuring out where deep neural net encodings fit on the spectrum would be worthwhile.

Also. my answer to bits(benign policy)<bits(treacherous policy) would depend on whether the benign policy is supposed to be about making paperclips in the same way humans do, or about maximizing human values over the lifetime of the universe in ways that humans will not be able to figure out themselves.

For the paperclip making imitation policy, I am somewhat more optimistic about tractability than in the more general case.
- michaelcohen Mar 4, 2021, 4:51 PM
  LW: 1 AF: 1
  AF Parent
  There will be plenty of functions $R_{i}$ that have fewer bits in their encoding than the real function $R^{d}$ used by the demonstrator.
  I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
  - Koen.Holtman Mar 5, 2021, 1:49 PM
    LW: 1 AF: 1
    AF Parent
    
    I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
    
    I have seen you mention a number of times in this comment thread that ‘this is not a problem because eventually the bad/wrong policies will disappear from the top set’. You have not qualified this statement with ‘but we need a very low $α$ like $α < 1 / | Π |$ to make this work in a safe way’, so I remain somewhat uncertain about your views are about how low $α$ needs to go.
    
    In any case, I’ll now try to convince you that if $α > 1 / | Π |$ , your statement that ‘when they’re wrong they’ll get removed from the posterior’ will not always mean what you might want it to mean.
    
    Is the demonstrator policy $π^{d}$ to get themselves killed?
    
    The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.
    
    Say that $π^{p} \in Π$ is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where $α > 1 / | Π |$ and where at time step $t$ , we have the likely case that $π^{p} \notin Π_{h < t}^{α}$ . In the world I constructed for the counterexample, the remaining top policies $Π_{h < t}^{α}$ now perform a synchronized treacherous turn where they kill the demonstrator.
    
    In time step $t + 1$ and later, the policies $Π_{h < t + 1}^{α}$ diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the $n u l l$ action. This eventually removes all ‘wrong’ policies from $Π_{h < t + 1 + i}^{α}$ , where ‘wrong’ means that they do not take the $n u l l$ action at all future time steps.
    
    The silver lining is perhaps that at least the agent will eventually stop, perform $n u l l$ actions only, after it has killed the demonstrator.
    
    Now. the paper proves that the behavior of the agent policy $π_{α}^{i}$ will approximate that of the true demonstrator policy $π^{d}$ closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy $π^{d}$ had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy $π^{d}$ is one where the demonstrator always intended to get themselves killed.
    
    This would be a somewhat unusual solution to the inner alignment problem.
    
    The math in the paper has you working in a fixed-policy setting where the demonstrator policy $π^{d}$ is immutable/time-invariant. The snag is that this does not imply that the policy $π^{d}$ defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force $π^{d}$ into a self-fulfilling trajectory where it kills the demonstrator.
    
    Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.

Koen.Holtman comments on Formal Solution to the Inner Alignment Problem

Case of α<1/|Π|: much better than I thought!

Case of α>1/|Π|: the challenge of designing a prior

Is the demonstrator policy πd to get themselves killed?

Case of $α < 1 / | Π |$ : much better than I thought!

Case of $α > 1 / | Π |$ : the challenge of designing a prior

Is the demonstrator policy $π^{d}$ to get themselves killed?