Koen.Holtman comments on Formal Solution to the Inner Alignment Problem

Koen.Holtman 25 Feb 2021 22:38 UTC
LW: 2 AF: 2
AF
Interesting paper! I like the focus on imitation learning, but the really new food-for-thought thing to me is the bit about dropping i.i.d. assumptions and then seeing how far you can get. I need to think more about the math in the paper before I can ask some specific questions about this i.i.d. thing.

My feelings about the post above are a bit more mixed. Claims about inner alignment always seem to generate a lot of traffic on this site. But a lot of this traffic consists of questions and clarification about what exactly counts as an inner alignment failure or a mesa optimization related failure. The term is so fluid that I find the quantitative feelings that people express in the comment section hard to interpret. Is everybody talking about the same $P (t r e a c h e r y)$ and $P (b a d)$ ?

Thought experiment counter-example

Moving beyond these mixed feelings, here is a fun thought experiment, for various values of fun. You state:

our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.

So now I will try to construct a counter-example to this claim: an example where mesa-optimizers (as I understand them) will accomplish very bad things even inside your construction.

Your construction uses a set of candidate policies $Π$ , where one of them equals the real demonstrator policy $π^{d}$ .

A single policy $π_{i}$ is a function $π_{i}$ were $π_{i} (a_{t} | h_{< t})$ computes the probability that action $a_{t}$ will be taken in the world state represented by the history $h_{< t}$ .

Here is the first special step in the thought experiment. I am going to define each $π_{i}$ by using a different reward function $R_{i}$ that the policy is going to try to maximize. Every policy definition will use the same world state transition predictor $P$ that allows me to estimate $P (h_{< t + 1} | h_{< t}, a_{t})$ for every $t$ . (I am probably getting slightly creative with the notation in the paper at this point.)

I now define each function $π_{i}$ as follows: I compute $π_{i} (a_{t} | h_{< t})$ by assigning a probability of 1 to the one $a_{t}$ that comes out of the $a r g m a x_{a}$ of the Bellman equation constructed using the shared predictor $P$ and the policy specific $R_{i}$ above. So the policy $π_{i}$ picks the action that will optimize the expected time discounted return of the reward function $R_{i}$ starting from the world state $h_{< t}$ , and projected forward by the predictor $P$ above, under the assumption that the policy also controls all future actions.

So each $π_{i}$ is a model with a definite inner optimizer inside of it: the inner optimizer does a search over a world state tree to find the optimal solution to the Bellman equation.

Now for the second special step: I am going to make some assumptions about the agent environment, building up to a treacherous turn scenario:
- The demonstrator in the paper is a human, working alone with the agent.
- The demonstrator has a big hammer that they will use to smash the agent when it starts misbehaving.
- The world model projections in the policies $π_{i}$ are all detailed enough that they know about the hammer, and they know that they can expect the agent to get smashed when it performs an unexpected action.
- Therefore, for all policies $π_{i}$ where $π_{i} \neq π^{d}$ , policies with reward functions not shared by the demonstrator, the optimal course of action is to take exactly that action that the demonstrator would expect the agent to take, to avoid getting smashed. The world models are good enough that they can predict this action.
- The above is of course only the optimal action in these policies because they all project that there will be a point in time $t$ where the demonstrator becomes distracted, distracted enough that the agent can kill the demonstrator with a single action without getting smashed first. The projection also shows that the agent will have plausible deniability to other humans afterwards, so that it will survive to do whatever it wants.
When that point $t$ in time comes, these policies will all take exactly the same action, the action that kills the demonstrator. Only in the time steps $t + 1$ and later, the policies start diverging in what actions they will take, because they have now accomplished the emergent sub-goal that was shared by all, so that each has their hands free to accomplish their real goal $R_{i}$ .

Now if I re-examine the claim:

our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.

I have to conclude that the mesa-optimizers have at least accomplished the feat of killing the demonstrator. Whether they then can accomplish any of their divergent post-treachery goals in later time steps remains an open question, but the damage has been done.

Issues

Some issues remain with this fun (for various values of fun) thought experiment.

I can note that at time $t$ , the policy $π^{d} \in Π$ that was fully aligned with the demonstrator did not want to kill the demonstrator, it would have taken another action. So, if I understand the paper right, if $π^{d}$ was still in the set of top policies at time $t$ , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.

But I am also reading in the abstract of the paper that

[...] queries to the demonstrator rapidly diminish in frequency

so I think this implies that over time, it is likely that $π^{d}$ might disappear from the top set. But I am not fully sure here. I might be reading the paper wrong.

Questions

So I have some questions:
- Does this thought experiment look reasonable or have I overlooked something?
- What about the probability that $π^{d}$ is still in the set of top policies at time $t$ ? To me, it looks like if we start with $10^{10}$ candidates in $Π$ and the agent quickly narrows this down to a top set of size $10^{4}$ , then it is likely that $π^{d}$ will not be in there anymore.
- Any strong opinions from anyone about whether the above counts as a definite inner alignment failure due to mesa optimization (I think so) or definitely not?
This comment is long enough already so I am not going to speculate here about possible ways to avoid inserting models like the $π_{i}$ models I constructed above into the set $Π$ . But speculation is welcome...

(I am calling this a ‘fun’ thought experiment because I am thinking of this as a limit case. This limit case useful to identify and provides some food for thought, but it does not really change my opinion about how stable or safe the approach in the paper might be in practice. I my experience, you can always find a limit case where things fail if you start looking for it.)
- michaelcohen 26 Feb 2021 15:49 UTC
  LW: 1 AF: 1
  AF Parent
  So, if I understand the paper right, if $π^{d}$ was still in the set of top policies at time $t$ , the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.
  This is exactly right.
  But I am also reading in the abstract of the paper that
  [...] queries to the demonstrator rapidly diminish in frequency
  The reason this is consistent is that queries will become infrequent, but they will still be well-timed. The choice of whether to query depends on what the treacherous models are doing. So if the treacherous models wait for a long time, then we will have a long time where no queries are needed, and as soon as they decide to be treacherous, we thwart that plan with a query to the demonstrator. So it does not imply that
  over time, it is likely that $π^{d}$ might disappear from the top set
  Does this thought experiment look reasonable or have I overlooked something?
  Yep, perfectly reasonable!
  What about the probability that $π^{d}$ is still in the set of top policies at time $t$ ?
  If we set $α$ small enough, we can make it arbitrarily like that $π^{d}$ never leaves the set of top policies.
  - Koen.Holtman 3 Mar 2021 15:44 UTC
    LW: 1 AF: 1
    AF Parent
    [long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.]
    
    If we set $α$ small enough, we can make it arbitrarily like[ly] that $π^{d}$ never leaves the set of top policies.
    
    OK, I think you are kind of saved here, in the case of my counter-example, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific:
    
    Say I refine the counter-example by adding the prior $w (π_{i}) = 1 / | Π |$ , so each candidate is considered equally likely from the start. Say that $| Π | = 10^{10}$ .
    
    The key thing to note in my example is that, before the time-synchronized treacherous turn happens for all policies $π_{i} \neq π^{d}$ , all policies $π_{i}$ are equally compatible with events so far, so the posterior $w (π_{i} | h_{< t})$ is the same for all of them.
    
    When I use $α = 10^{- 4}$ with these all-the-same-posterior policies, then (if I am reading equation (2) correctly) the set $Π_{h < t}^{α}$ of top policies will have size $N = 10^{4}$ . Ties are broken arbitrarily, so the real $π^{d}$ is unlikely to be among these top $N = 10^{4} << 10^{10}$ candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here.
    
    However.
    
    Case of $α < 1 / | Π |$ : much better than I thought!
    
    However, and this is where things get interesting, if we use $α = 10^{11}$ , so $α < 1 / | Π |$ , then equation (2) will compute $Π_{h < t}^{α} = Π$ , the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn.
    
    So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the $β$ based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counter-example above, or if you had a more general reason -- the paper does not say.
    
    Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for $α < 1 / | Π |$ and for certain values of ‘the inner alignment problem’. I can also imagine several setups where $α < 1 / | Π |$ would still lead to reasonably fast learning (=reasonably fast drop in frequency of demonstrator requests).
    
    This is what progress looks like. There are certain boundary conditions here that might be unrealistic: $Π$ needs to be a finite set and the real policy needs to be in there, but at least we have some well-defined boundary conditions we can now look at, where we can try to understand and interpret the math further.
    
    Case of $α > 1 / | Π |$ : the challenge of designing a prior
    
    In the case of using $α = 10^{- 4}$ or $α = 10^{- 9}$ , because you want the system to learn faster, you are instead looking at a setup where, if you want to solve the inner alignment problem, you will have to shape the prior values $w (π_{i})$ in such a way that $w^{d}$ gets a higher prior than the all the treacherous policies.
    
    My counter-example above shows that if you are using the 1/(number of bits in the program that computes $π_{i}$ ) as the prior, you will not get what you need. There will be plenty of functions $R_{i}$ that have fewer bits in their encoding than the real function $R^{d}$ used by the demonstrator.
    
    Earlier in this comment section, there is a whole sub-thread with speculation on the number of bits needed too encode benign vs. treacherous policies, but for me that discussion does not yet arrive at any place where I would get convinced that the problem of assigning higher priors to benign vs. treacherous policies has been made tractable yet. (Vanessa has identified some additional moving parts however.)
    
    There is of course a tradition in the AI safety community where this is made ‘tractable’ by the device of polling many AI researchers to ask them whether they think that bits(benign policy)<bits(treacherous policy) for future ML systems, and then graphing the results, but this is not what I would call a tractable solution.
    
    What I would call tractable is a solution like the one, for a much simpler case, in section 10.2.4 of my paper Counterfactual Planning in AGI Systems. I show there that random exploration can be used to make sure that bits(agent environment model which includes unwanted self-knowledge about agent compute core internals) $≫$ bits(agent environment model that lacks this unwanted self-knowledge), no matter what the encoding. Extending this to the bits(benign policy) case would be nice, but I can’t immediately see a route here.
    
    My answer to the above hypothetical bits(benign policy)<bits(treacherous policy) poll is that we cannot expect this to be true any possible encoding of policies (see counter-example above), but it might be true for some encodings. Figuring out where deep neural net encodings fit on the spectrum would be worthwhile.
    
    Also. my answer to bits(benign policy)<bits(treacherous policy) would depend on whether the benign policy is supposed to be about making paperclips in the same way humans do, or about maximizing human values over the lifetime of the universe in ways that humans will not be able to figure out themselves.
    
    For the paperclip making imitation policy, I am somewhat more optimistic about tractability than in the more general case.
    - michaelcohen 4 Mar 2021 16:51 UTC
      LW: 1 AF: 1
      AF Parent
      There will be plenty of functions $R_{i}$ that have fewer bits in their encoding than the real function $R^{d}$ used by the demonstrator.
      I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
      - Koen.Holtman 5 Mar 2021 13:49 UTC
        LW: 1 AF: 1
        AF Parent
        
        I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
        
        I have seen you mention a number of times in this comment thread that ‘this is not a problem because eventually the bad/wrong policies will disappear from the top set’. You have not qualified this statement with ‘but we need a very low $α$ like $α < 1 / | Π |$ to make this work in a safe way’, so I remain somewhat uncertain about your views are about how low $α$ needs to go.
        
        In any case, I’ll now try to convince you that if $α > 1 / | Π |$ , your statement that ‘when they’re wrong they’ll get removed from the posterior’ will not always mean what you might want it to mean.
        
        Is the demonstrator policy $π^{d}$ to get themselves killed?
        
        The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.
        
        Say that $π^{p} \in Π$ is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where $α > 1 / | Π |$ and where at time step $t$ , we have the likely case that $π^{p} \notin Π_{h < t}^{α}$ . In the world I constructed for the counterexample, the remaining top policies $Π_{h < t}^{α}$ now perform a synchronized treacherous turn where they kill the demonstrator.
        
        In time step $t + 1$ and later, the policies $Π_{h < t + 1}^{α}$ diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the $n u l l$ action. This eventually removes all ‘wrong’ policies from $Π_{h < t + 1 + i}^{α}$ , where ‘wrong’ means that they do not take the $n u l l$ action at all future time steps.
        
        The silver lining is perhaps that at least the agent will eventually stop, perform $n u l l$ actions only, after it has killed the demonstrator.
        
        Now. the paper proves that the behavior of the agent policy $π_{α}^{i}$ will approximate that of the true demonstrator policy $π^{d}$ closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy $π^{d}$ had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy $π^{d}$ is one where the demonstrator always intended to get themselves killed.
        
        This would be a somewhat unusual solution to the inner alignment problem.
        
        The math in the paper has you working in a fixed-policy setting where the demonstrator policy $π^{d}$ is immutable/time-invariant. The snag is that this does not imply that the policy $π^{d}$ defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force $π^{d}$ into a self-fulfilling trajectory where it kills the demonstrator.
        
        Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.

Koen.Holtman comments on Formal Solution to the Inner Alignment Problem

Thought experiment counter-example

Issues

Questions

Case of α<1/|Π|: much better than I thought!

Case of α>1/|Π|: the challenge of designing a prior

Is the demonstrator policy πd to get themselves killed?

Case of $α < 1 / | Π |$ : much better than I thought!

Case of $α > 1 / | Π |$ : the challenge of designing a prior

Is the demonstrator policy $π^{d}$ to get themselves killed?