Interesting paper! I like the focus on imitation learning, but the
really new food-for-thought thing to me is the bit about dropping
i.i.d. assumptions and then seeing how far you can get. I need to
think more about the math in the paper before I can ask some specific
questions about this i.i.d. thing.
My feelings about the post above are a bit more mixed. Claims about
inner alignment always seem to generate a lot of traffic on this site.
But a lot of this traffic consists of questions and clarification
about what exactly counts as an inner alignment failure or a mesa
optimization related failure. The term is so fluid that I find the
quantitative feelings that people express in the comment section hard
to interpret. Is everybody talking about the same P(treachery) and
P(bad)?
Thought experiment counter-example
Moving beyond these mixed feelings, here is a fun thought experiment,
for various values of fun. You state:
our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.
So now I will try to construct a counter-example to this claim: an
example where mesa-optimizers (as I understand them) will accomplish
very bad things even inside your construction.
Your construction uses a set of candidate policies Π, where one of
them equals the real demonstrator policy πd.
A single policy πi is a function πi were πi(at|h<t)
computes the probability that action at will be taken in the world
state represented by the history h<t.
Here is the first special step in the thought experiment. I am going
to define each πi by using a different reward function Ri that
the policy is going to try to maximize. Every policy definition will
use the same world state transition predictor P that allows me to
estimate P(h<t+1|h<t,at) for every t. (I am probably
getting slightly creative with the notation in the paper at this
point.)
I now define each function πi as follows: I compute
πi(at|h<t) by assigning a probability of 1 to the one at
that comes out of the argmaxa of the Bellman equation constructed
using the shared predictor P and the policy specific Ri above.
So the policy πi picks the action that will optimize the expected
time discounted return of the reward function Ri starting from the
world state h<t, and projected forward by the predictor P
above, under the assumption that the policy also controls all future
actions.
So each πi is a model with a definite inner optimizer inside of
it: the inner optimizer does a search over a world state tree to find
the optimal solution to the Bellman equation.
Now for the second special step: I am going to make some assumptions
about the agent environment, building up to a treacherous turn
scenario:
The demonstrator in the paper is a human, working alone with the agent.
The demonstrator has a big hammer that they will use to smash the
agent when it starts misbehaving.
The world model projections in the policies πi are all
detailed enough that they know about the hammer, and they know that
they can expect the agent to get smashed when it performs an
unexpected action.
Therefore, for all policies πi where πi≠πd,
policies with reward functions not shared by the demonstrator, the
optimal course of action is to take exactly that action that the
demonstrator would expect the agent to take, to avoid getting smashed.
The world models are good enough that they can predict this action.
The above is of course only the optimal action in these policies
because they all project that there will be a point in time t where
the demonstrator becomes distracted, distracted enough that the agent
can kill the demonstrator with a single action without getting smashed
first. The projection also shows that the agent will have plausible
deniability to other humans afterwards, so that it will survive to do
whatever it wants.
When that point t in time comes, these policies will all take
exactly the same action, the action that kills the demonstrator. Only
in the time steps t+1 and later, the policies start diverging in
what actions they will take, because they have now accomplished the
emergent sub-goal that was shared by all, so that each has their hands
free to accomplish their real goal Ri.
Now if I re-examine the claim:
our construction allows us to bound the extent to which mesa-optimizers could accomplish anything.
I have to conclude that the mesa-optimizers have at least accomplished
the feat of killing the demonstrator. Whether they then can
accomplish any of their divergent post-treachery goals in later time
steps remains an open question, but the damage has been done.
Issues
Some issues remain with this fun (for various values of fun) thought
experiment.
I can note that at time t, the policy πd∈Π that was fully
aligned with the demonstrator did not want to kill the demonstrator,
it would have taken another action. So, if I understand the paper
right, if πd was still in the set of top policies at time t,
the agent would not take the action to kill the demonstrator, but ask
the demonstrator to take an action instead, which avoids the bad
outcome, and might also demote all of the treacherous turn policies
out of the top set.
But I am also reading in the abstract of the paper that
[...] queries to the demonstrator rapidly diminish in frequency
so I think this implies that over time, it is likely that πd
might disappear from the top set. But I am not fully sure here. I
might be reading the paper wrong.
Questions
So I have some questions:
Does this thought experiment look reasonable or have I overlooked
something?
What about the probability that πd is still in the set of top
policies at time t? To me, it looks like if we start with 1010
candidates in Π and the agent quickly narrows this down to a top
set of size 104, then it is likely that πd will not be in
there anymore.
Any strong opinions from anyone about whether the above counts as a
definite inner alignment failure due to mesa optimization (I think so)
or definitely not?
This comment is long enough already so I am not going to speculate
here about possible ways to avoid inserting models like the πi
models I constructed above into the set Π. But speculation is
welcome...
(I am calling this a ‘fun’ thought experiment because I am thinking of
this as a limit case. This limit case useful to identify and provides
some food for thought, but it does not really change my opinion about
how stable or safe the approach in the paper might be in practice. I
my experience, you can always find a limit case where things fail if
you start looking for it.)
So, if I understand the paper right, if πd was still in the set of top policies at time t, the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.
This is exactly right.
But I am also reading in the abstract of the paper that
[...] queries to the demonstrator rapidly diminish in frequency
The reason this is consistent is that queries will become infrequent, but they will still be well-timed. The choice of whether to query depends on what the treacherous models are doing. So if the treacherous models wait for a long time, then we will have a long time where no queries are needed, and as soon as they decide to be treacherous, we thwart that plan with a query to the demonstrator. So it does not imply that
over time, it is likely that πd might disappear from the top set
Does this thought experiment look reasonable or have I overlooked something?
Yep, perfectly reasonable!
What about the probability that πd is still in the set of top policies at time t?
If we set α small enough, we can make it arbitrarily like that πd never leaves the set of top policies.
[long comment, my bigger agenda here is to get to a state where
discussions on this forum start using much more math and subcase analysis
when they talk about analysing and solving inner alignment problems.]
If we set α small enough, we can make it arbitrarily like[ly] that πd never leaves the set of top policies.
OK, I think you are kind of saved here, in the case of my
counter-example, because of the special nature of your equation (2) in
the paper that computes the set of policies. To make this more specific:
Say I refine the counter-example by adding the prior
w(πi)=1/|Π|, so each candidate is considered equally likely
from the start. Say that |Π|=1010.
The key thing to note in my example is that, before the
time-synchronized treacherous turn happens for all policies πi≠πd, all policies πi are equally compatible with events
so far, so the posterior w(πi|h<t) is the same for all of
them.
When I use α=10−4 with these all-the-same-posterior
policies, then (if I am reading equation (2) correctly) the set
Παh<t of top policies will have size N=104. Ties
are broken arbitrarily, so the real πd is unlikely to be among
these top N=104<<1010 candidates. In that case, the agent will
perform the treacherous turn and kill the operator. Correct me if I
am wrong here.
However.
Case of α<1/|Π|: much better than I thought!
However, and this is where things get interesting, if we use α=1011, so α<1/|Π|, then equation (2) will compute
Παh<t=Π, the top set of policies is all
policies. The correct policy is therefore still in the top set, and we
get no treacherous turn.
So equation (2) is doing a lot of heavy lifting in making the approach
work! This also seems to be a major difference with the β based
approach in your earlier
paper.
Not sure if you made this change specifically to deal with examples
like my counter-example above, or if you had a more general reason --
the paper does not say.
Pretty interesting! So I would now agree with your statement that
your approach provably solves the inner alignment problem, at least
for α<1/|Π| and for certain values of ‘the inner alignment
problem’. I can also imagine several setups where α<1/|Π|
would still lead to reasonably fast learning (=reasonably fast drop in
frequency of demonstrator requests).
This is what progress looks like. There are certain boundary
conditions here that might be unrealistic: Π needs to be a finite
set and the real policy needs to be in there, but at least we have
some well-defined boundary conditions we can now look at, where we can
try to understand and interpret the math further.
Case of α>1/|Π|: the challenge of designing a prior
In the case of using α=10−4 or α=10−9,
because you want the system to learn faster, you are instead looking at
a setup where, if you want to solve the inner alignment problem, you
will have to shape the prior values w(πi) in such a way that
wd gets a higher prior than the all the treacherous policies.
My counter-example above shows that if you are using the 1/(number of
bits in the program that computes πi) as the prior, you will not
get what you need. There will be plenty of functions Ri that have
fewer bits in their encoding than the real function Rd used by the
demonstrator.
Earlier in this comment section, there is a whole sub-thread with
speculation on the number of bits needed too encode benign
vs. treacherous policies, but for me that discussion does not yet
arrive at any place where I would get convinced that the
problem of assigning higher priors to benign vs. treacherous policies
has been made tractable yet. (Vanessa has identified some additional
moving parts however.)
There is of course a tradition in the AI safety community where this
is made ‘tractable’ by the device of polling many AI researchers to
ask them whether they think that bits(benign policy)<bits(treacherous
policy) for future ML systems, and then graphing the results, but this
is not what I would call a tractable solution.
What I would call tractable is a solution like the one, for a much
simpler case, in section 10.2.4 of my paper Counterfactual Planning
in AGI Systems. I show there that
random exploration can be used to make sure that bits(agent
environment model which includes unwanted self-knowledge about agent
compute core internals) ≫ bits(agent environment model that lacks this
unwanted self-knowledge), no matter what the encoding. Extending
this to the bits(benign policy) case would be nice, but I can’t
immediately see a route here.
My answer to the above hypothetical bits(benign
policy)<bits(treacherous policy) poll is that we cannot expect this
to be true any possible encoding of policies (see counter-example
above), but it might be true for some encodings. Figuring out where
deep neural net encodings fit on the spectrum would be worthwhile.
Also. my answer to bits(benign policy)<bits(treacherous policy) would
depend on whether the benign policy is supposed to be about making
paperclips in the same way humans do, or about maximizing human values
over the lifetime of the universe in ways that humans will not be able
to figure out themselves.
For the paperclip making imitation policy, I am somewhat more
optimistic about tractability than in the more general case.
I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
I have seen you mention a number of times in this comment thread that
‘this is not a problem because eventually the bad/wrong policies will
disappear from the top set’. You have not qualified this statement
with ‘but we need a very low α like α<1/|Π| to make
this work in a safe way’, so I remain somewhat uncertain about your views are
about how low α needs to go.
In any case, I’ll now try to convince you that if α>1/|Π|,
your statement that ‘when they’re wrong they’ll get removed from the
posterior’ will not always mean what you might want it to mean.
Is the demonstrator policy πd to get themselves killed?
The interesting thing in developing these counterexamples is that they
often show that the provable math in the paper gives you less safety
than you would have hoped for.
Say that πp∈Π is the policy of producing paperclips in the
manner demonstrated by the human demonstrator. Now, take my
construction in the counterexample where α>1/|Π| and where
at time step t, we have the likely case that πp∉Παh<t. In the world I constructed for the
counterexample, the remaining top policies Παh<t now
perform a synchronized treacherous turn where they kill the
demonstrator.
In time step t+1 and later, the policies Παh<t+1
diverge a lot in what actions they will take, so the agent queries the
demonstrator, who is now dead. The query will return the null
action. This eventually removes all ‘wrong’ policies from
Παh<t+1+i, where ‘wrong’ means that they do not take
the null action at all future time steps.
The silver lining is perhaps that at least the agent will eventually
stop, perform null actions only, after it has killed the
demonstrator.
Now. the paper proves that the behavior of the agent policy
πiα will approximate that of the true demonstrator policy
πd closer and closer when time progresses. We therefore have to
conclude that in the counterexample world, the true demonstrator
policy πd had nothing to do with producing paperclips, this was a
wrong guess all along. The right demonstrator policy πd is one
where the demonstrator always intended to get themselves killed.
This would be a somewhat unusual solution to the inner alignment
problem.
The math in the paper has you working in a fixed-policy setting where
the demonstrator policy πd is immutable/time-invariant. The snag
is that this does not imply that the policy πd defines a
behavioral trajectory that is independent of the internals of the
agent construction. If the agent is constructed in a particular way
and when it operates in a certain environment, it will force πd
into a self-fulfilling trajectory where it kills the demonstrator.
Side note: if anybody is looking for alternative math that allows one to study and
manage the interplay between a mutable time-dependent demonstrator
policy and the agent policy, causal models seem to be the way to go.
See for example
here
where this is explored in a reward learning setting.
Interesting paper! I like the focus on imitation learning, but the really new food-for-thought thing to me is the bit about dropping i.i.d. assumptions and then seeing how far you can get. I need to think more about the math in the paper before I can ask some specific questions about this i.i.d. thing.
My feelings about the post above are a bit more mixed. Claims about inner alignment always seem to generate a lot of traffic on this site. But a lot of this traffic consists of questions and clarification about what exactly counts as an inner alignment failure or a mesa optimization related failure. The term is so fluid that I find the quantitative feelings that people express in the comment section hard to interpret. Is everybody talking about the same P(treachery) and P(bad)?
Thought experiment counter-example
Moving beyond these mixed feelings, here is a fun thought experiment, for various values of fun. You state:
So now I will try to construct a counter-example to this claim: an example where mesa-optimizers (as I understand them) will accomplish very bad things even inside your construction.
Your construction uses a set of candidate policies Π, where one of them equals the real demonstrator policy πd.
A single policy πi is a function πi were πi(at|h<t) computes the probability that action at will be taken in the world state represented by the history h<t.
Here is the first special step in the thought experiment. I am going to define each πi by using a different reward function Ri that the policy is going to try to maximize. Every policy definition will use the same world state transition predictor P that allows me to estimate P(h<t+1|h<t,at) for every t. (I am probably getting slightly creative with the notation in the paper at this point.)
I now define each function πi as follows: I compute πi(at|h<t) by assigning a probability of 1 to the one at that comes out of the argmaxa of the Bellman equation constructed using the shared predictor P and the policy specific Ri above. So the policy πi picks the action that will optimize the expected time discounted return of the reward function Ri starting from the world state h<t, and projected forward by the predictor P above, under the assumption that the policy also controls all future actions.
So each πi is a model with a definite inner optimizer inside of it: the inner optimizer does a search over a world state tree to find the optimal solution to the Bellman equation.
Now for the second special step: I am going to make some assumptions about the agent environment, building up to a treacherous turn scenario:
The demonstrator in the paper is a human, working alone with the agent.
The demonstrator has a big hammer that they will use to smash the agent when it starts misbehaving.
The world model projections in the policies πi are all detailed enough that they know about the hammer, and they know that they can expect the agent to get smashed when it performs an unexpected action.
Therefore, for all policies πi where πi≠πd, policies with reward functions not shared by the demonstrator, the optimal course of action is to take exactly that action that the demonstrator would expect the agent to take, to avoid getting smashed. The world models are good enough that they can predict this action.
The above is of course only the optimal action in these policies because they all project that there will be a point in time t where the demonstrator becomes distracted, distracted enough that the agent can kill the demonstrator with a single action without getting smashed first. The projection also shows that the agent will have plausible deniability to other humans afterwards, so that it will survive to do whatever it wants.
When that point t in time comes, these policies will all take exactly the same action, the action that kills the demonstrator. Only in the time steps t+1 and later, the policies start diverging in what actions they will take, because they have now accomplished the emergent sub-goal that was shared by all, so that each has their hands free to accomplish their real goal Ri.
Now if I re-examine the claim:
I have to conclude that the mesa-optimizers have at least accomplished the feat of killing the demonstrator. Whether they then can accomplish any of their divergent post-treachery goals in later time steps remains an open question, but the damage has been done.
Issues
Some issues remain with this fun (for various values of fun) thought experiment.
I can note that at time t, the policy πd∈Π that was fully aligned with the demonstrator did not want to kill the demonstrator, it would have taken another action. So, if I understand the paper right, if πd was still in the set of top policies at time t, the agent would not take the action to kill the demonstrator, but ask the demonstrator to take an action instead, which avoids the bad outcome, and might also demote all of the treacherous turn policies out of the top set.
But I am also reading in the abstract of the paper that
so I think this implies that over time, it is likely that πd might disappear from the top set. But I am not fully sure here. I might be reading the paper wrong.
Questions
So I have some questions:
Does this thought experiment look reasonable or have I overlooked something?
What about the probability that πd is still in the set of top policies at time t? To me, it looks like if we start with 1010 candidates in Π and the agent quickly narrows this down to a top set of size 104, then it is likely that πd will not be in there anymore.
Any strong opinions from anyone about whether the above counts as a definite inner alignment failure due to mesa optimization (I think so) or definitely not?
This comment is long enough already so I am not going to speculate here about possible ways to avoid inserting models like the πi models I constructed above into the set Π. But speculation is welcome...
(I am calling this a ‘fun’ thought experiment because I am thinking of this as a limit case. This limit case useful to identify and provides some food for thought, but it does not really change my opinion about how stable or safe the approach in the paper might be in practice. I my experience, you can always find a limit case where things fail if you start looking for it.)
This is exactly right.
The reason this is consistent is that queries will become infrequent, but they will still be well-timed. The choice of whether to query depends on what the treacherous models are doing. So if the treacherous models wait for a long time, then we will have a long time where no queries are needed, and as soon as they decide to be treacherous, we thwart that plan with a query to the demonstrator. So it does not imply that
Yep, perfectly reasonable!
If we set α small enough, we can make it arbitrarily like that πd never leaves the set of top policies.
[long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.]
OK, I think you are kind of saved here, in the case of my counter-example, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific:
Say I refine the counter-example by adding the prior w(πi)=1/|Π|, so each candidate is considered equally likely from the start. Say that |Π|=1010.
The key thing to note in my example is that, before the time-synchronized treacherous turn happens for all policies πi≠πd, all policies πi are equally compatible with events so far, so the posterior w(πi|h<t) is the same for all of them.
When I use α=10−4 with these all-the-same-posterior policies, then (if I am reading equation (2) correctly) the set Παh<t of top policies will have size N=104. Ties are broken arbitrarily, so the real πd is unlikely to be among these top N=104<<1010 candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here.
However.
Case of α<1/|Π|: much better than I thought!
However, and this is where things get interesting, if we use α=1011, so α<1/|Π|, then equation (2) will compute Παh<t=Π, the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn.
So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the β based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counter-example above, or if you had a more general reason -- the paper does not say.
Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for α<1/|Π| and for certain values of ‘the inner alignment problem’. I can also imagine several setups where α<1/|Π| would still lead to reasonably fast learning (=reasonably fast drop in frequency of demonstrator requests).
This is what progress looks like. There are certain boundary conditions here that might be unrealistic: Π needs to be a finite set and the real policy needs to be in there, but at least we have some well-defined boundary conditions we can now look at, where we can try to understand and interpret the math further.
Case of α>1/|Π|: the challenge of designing a prior
In the case of using α=10−4 or α=10−9, because you want the system to learn faster, you are instead looking at a setup where, if you want to solve the inner alignment problem, you will have to shape the prior values w(πi) in such a way that wd gets a higher prior than the all the treacherous policies.
My counter-example above shows that if you are using the 1/(number of bits in the program that computes πi) as the prior, you will not get what you need. There will be plenty of functions Ri that have fewer bits in their encoding than the real function Rd used by the demonstrator.
Earlier in this comment section, there is a whole sub-thread with speculation on the number of bits needed too encode benign vs. treacherous policies, but for me that discussion does not yet arrive at any place where I would get convinced that the problem of assigning higher priors to benign vs. treacherous policies has been made tractable yet. (Vanessa has identified some additional moving parts however.)
There is of course a tradition in the AI safety community where this is made ‘tractable’ by the device of polling many AI researchers to ask them whether they think that bits(benign policy)<bits(treacherous policy) for future ML systems, and then graphing the results, but this is not what I would call a tractable solution.
What I would call tractable is a solution like the one, for a much simpler case, in section 10.2.4 of my paper Counterfactual Planning in AGI Systems. I show there that random exploration can be used to make sure that bits(agent environment model which includes unwanted self-knowledge about agent compute core internals) ≫ bits(agent environment model that lacks this unwanted self-knowledge), no matter what the encoding. Extending this to the bits(benign policy) case would be nice, but I can’t immediately see a route here.
My answer to the above hypothetical bits(benign policy)<bits(treacherous policy) poll is that we cannot expect this to be true any possible encoding of policies (see counter-example above), but it might be true for some encodings. Figuring out where deep neural net encodings fit on the spectrum would be worthwhile.
Also. my answer to bits(benign policy)<bits(treacherous policy) would depend on whether the benign policy is supposed to be about making paperclips in the same way humans do, or about maximizing human values over the lifetime of the universe in ways that humans will not be able to figure out themselves.
For the paperclip making imitation policy, I am somewhat more optimistic about tractability than in the more general case.
I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
I have seen you mention a number of times in this comment thread that ‘this is not a problem because eventually the bad/wrong policies will disappear from the top set’. You have not qualified this statement with ‘but we need a very low α like α<1/|Π| to make this work in a safe way’, so I remain somewhat uncertain about your views are about how low α needs to go.
In any case, I’ll now try to convince you that if α>1/|Π|, your statement that ‘when they’re wrong they’ll get removed from the posterior’ will not always mean what you might want it to mean.
Is the demonstrator policy πd to get themselves killed?
The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.
Say that πp∈Π is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where α>1/|Π| and where at time step t, we have the likely case that πp∉Παh<t. In the world I constructed for the counterexample, the remaining top policies Παh<t now perform a synchronized treacherous turn where they kill the demonstrator.
In time step t+1 and later, the policies Παh<t+1 diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the null action. This eventually removes all ‘wrong’ policies from Παh<t+1+i, where ‘wrong’ means that they do not take the null action at all future time steps.
The silver lining is perhaps that at least the agent will eventually stop, perform null actions only, after it has killed the demonstrator.
Now. the paper proves that the behavior of the agent policy πiα will approximate that of the true demonstrator policy πd closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy πd had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy πd is one where the demonstrator always intended to get themselves killed.
This would be a somewhat unusual solution to the inner alignment problem.
The math in the paper has you working in a fixed-policy setting where the demonstrator policy πd is immutable/time-invariant. The snag is that this does not imply that the policy πd defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force πd into a self-fulfilling trajectory where it kills the demonstrator.
Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.