[long comment, my bigger agenda here is to get to a state where
discussions on this forum start using much more math and subcase analysis
when they talk about analysing and solving inner alignment problems.]
If we set α small enough, we can make it arbitrarily like[ly] that πd never leaves the set of top policies.
OK, I think you are kind of saved here, in the case of my
counter-example, because of the special nature of your equation (2) in
the paper that computes the set of policies. To make this more specific:
Say I refine the counter-example by adding the prior
w(πi)=1/|Π|, so each candidate is considered equally likely
from the start. Say that |Π|=1010.
The key thing to note in my example is that, before the
time-synchronized treacherous turn happens for all policies πi≠πd, all policies πi are equally compatible with events
so far, so the posterior w(πi|h<t) is the same for all of
them.
When I use α=10−4 with these all-the-same-posterior
policies, then (if I am reading equation (2) correctly) the set
Παh<t of top policies will have size N=104. Ties
are broken arbitrarily, so the real πd is unlikely to be among
these top N=104<<1010 candidates. In that case, the agent will
perform the treacherous turn and kill the operator. Correct me if I
am wrong here.
However.
Case of α<1/|Π|: much better than I thought!
However, and this is where things get interesting, if we use α=1011, so α<1/|Π|, then equation (2) will compute
Παh<t=Π, the top set of policies is all
policies. The correct policy is therefore still in the top set, and we
get no treacherous turn.
So equation (2) is doing a lot of heavy lifting in making the approach
work! This also seems to be a major difference with the β based
approach in your earlier
paper.
Not sure if you made this change specifically to deal with examples
like my counter-example above, or if you had a more general reason --
the paper does not say.
Pretty interesting! So I would now agree with your statement that
your approach provably solves the inner alignment problem, at least
for α<1/|Π| and for certain values of ‘the inner alignment
problem’. I can also imagine several setups where α<1/|Π|
would still lead to reasonably fast learning (=reasonably fast drop in
frequency of demonstrator requests).
This is what progress looks like. There are certain boundary
conditions here that might be unrealistic: Π needs to be a finite
set and the real policy needs to be in there, but at least we have
some well-defined boundary conditions we can now look at, where we can
try to understand and interpret the math further.
Case of α>1/|Π|: the challenge of designing a prior
In the case of using α=10−4 or α=10−9,
because you want the system to learn faster, you are instead looking at
a setup where, if you want to solve the inner alignment problem, you
will have to shape the prior values w(πi) in such a way that
wd gets a higher prior than the all the treacherous policies.
My counter-example above shows that if you are using the 1/(number of
bits in the program that computes πi) as the prior, you will not
get what you need. There will be plenty of functions Ri that have
fewer bits in their encoding than the real function Rd used by the
demonstrator.
Earlier in this comment section, there is a whole sub-thread with
speculation on the number of bits needed too encode benign
vs. treacherous policies, but for me that discussion does not yet
arrive at any place where I would get convinced that the
problem of assigning higher priors to benign vs. treacherous policies
has been made tractable yet. (Vanessa has identified some additional
moving parts however.)
There is of course a tradition in the AI safety community where this
is made ‘tractable’ by the device of polling many AI researchers to
ask them whether they think that bits(benign policy)<bits(treacherous
policy) for future ML systems, and then graphing the results, but this
is not what I would call a tractable solution.
What I would call tractable is a solution like the one, for a much
simpler case, in section 10.2.4 of my paper Counterfactual Planning
in AGI Systems. I show there that
random exploration can be used to make sure that bits(agent
environment model which includes unwanted self-knowledge about agent
compute core internals) ≫ bits(agent environment model that lacks this
unwanted self-knowledge), no matter what the encoding. Extending
this to the bits(benign policy) case would be nice, but I can’t
immediately see a route here.
My answer to the above hypothetical bits(benign
policy)<bits(treacherous policy) poll is that we cannot expect this
to be true any possible encoding of policies (see counter-example
above), but it might be true for some encodings. Figuring out where
deep neural net encodings fit on the spectrum would be worthwhile.
Also. my answer to bits(benign policy)<bits(treacherous policy) would
depend on whether the benign policy is supposed to be about making
paperclips in the same way humans do, or about maximizing human values
over the lifetime of the universe in ways that humans will not be able
to figure out themselves.
For the paperclip making imitation policy, I am somewhat more
optimistic about tractability than in the more general case.
I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
I have seen you mention a number of times in this comment thread that
‘this is not a problem because eventually the bad/wrong policies will
disappear from the top set’. You have not qualified this statement
with ‘but we need a very low α like α<1/|Π| to make
this work in a safe way’, so I remain somewhat uncertain about your views are
about how low α needs to go.
In any case, I’ll now try to convince you that if α>1/|Π|,
your statement that ‘when they’re wrong they’ll get removed from the
posterior’ will not always mean what you might want it to mean.
Is the demonstrator policy πd to get themselves killed?
The interesting thing in developing these counterexamples is that they
often show that the provable math in the paper gives you less safety
than you would have hoped for.
Say that πp∈Π is the policy of producing paperclips in the
manner demonstrated by the human demonstrator. Now, take my
construction in the counterexample where α>1/|Π| and where
at time step t, we have the likely case that πp∉Παh<t. In the world I constructed for the
counterexample, the remaining top policies Παh<t now
perform a synchronized treacherous turn where they kill the
demonstrator.
In time step t+1 and later, the policies Παh<t+1
diverge a lot in what actions they will take, so the agent queries the
demonstrator, who is now dead. The query will return the null
action. This eventually removes all ‘wrong’ policies from
Παh<t+1+i, where ‘wrong’ means that they do not take
the null action at all future time steps.
The silver lining is perhaps that at least the agent will eventually
stop, perform null actions only, after it has killed the
demonstrator.
Now. the paper proves that the behavior of the agent policy
πiα will approximate that of the true demonstrator policy
πd closer and closer when time progresses. We therefore have to
conclude that in the counterexample world, the true demonstrator
policy πd had nothing to do with producing paperclips, this was a
wrong guess all along. The right demonstrator policy πd is one
where the demonstrator always intended to get themselves killed.
This would be a somewhat unusual solution to the inner alignment
problem.
The math in the paper has you working in a fixed-policy setting where
the demonstrator policy πd is immutable/time-invariant. The snag
is that this does not imply that the policy πd defines a
behavioral trajectory that is independent of the internals of the
agent construction. If the agent is constructed in a particular way
and when it operates in a certain environment, it will force πd
into a self-fulfilling trajectory where it kills the demonstrator.
Side note: if anybody is looking for alternative math that allows one to study and
manage the interplay between a mutable time-dependent demonstrator
policy and the agent policy, causal models seem to be the way to go.
See for example
here
where this is explored in a reward learning setting.
[long comment, my bigger agenda here is to get to a state where discussions on this forum start using much more math and subcase analysis when they talk about analysing and solving inner alignment problems.]
OK, I think you are kind of saved here, in the case of my counter-example, because of the special nature of your equation (2) in the paper that computes the set of policies. To make this more specific:
Say I refine the counter-example by adding the prior w(πi)=1/|Π|, so each candidate is considered equally likely from the start. Say that |Π|=1010.
The key thing to note in my example is that, before the time-synchronized treacherous turn happens for all policies πi≠πd, all policies πi are equally compatible with events so far, so the posterior w(πi|h<t) is the same for all of them.
When I use α=10−4 with these all-the-same-posterior policies, then (if I am reading equation (2) correctly) the set Παh<t of top policies will have size N=104. Ties are broken arbitrarily, so the real πd is unlikely to be among these top N=104<<1010 candidates. In that case, the agent will perform the treacherous turn and kill the operator. Correct me if I am wrong here.
However.
Case of α<1/|Π|: much better than I thought!
However, and this is where things get interesting, if we use α=1011, so α<1/|Π|, then equation (2) will compute Παh<t=Π, the top set of policies is all policies. The correct policy is therefore still in the top set, and we get no treacherous turn.
So equation (2) is doing a lot of heavy lifting in making the approach work! This also seems to be a major difference with the β based approach in your earlier paper. Not sure if you made this change specifically to deal with examples like my counter-example above, or if you had a more general reason -- the paper does not say.
Pretty interesting! So I would now agree with your statement that your approach provably solves the inner alignment problem, at least for α<1/|Π| and for certain values of ‘the inner alignment problem’. I can also imagine several setups where α<1/|Π| would still lead to reasonably fast learning (=reasonably fast drop in frequency of demonstrator requests).
This is what progress looks like. There are certain boundary conditions here that might be unrealistic: Π needs to be a finite set and the real policy needs to be in there, but at least we have some well-defined boundary conditions we can now look at, where we can try to understand and interpret the math further.
Case of α>1/|Π|: the challenge of designing a prior
In the case of using α=10−4 or α=10−9, because you want the system to learn faster, you are instead looking at a setup where, if you want to solve the inner alignment problem, you will have to shape the prior values w(πi) in such a way that wd gets a higher prior than the all the treacherous policies.
My counter-example above shows that if you are using the 1/(number of bits in the program that computes πi) as the prior, you will not get what you need. There will be plenty of functions Ri that have fewer bits in their encoding than the real function Rd used by the demonstrator.
Earlier in this comment section, there is a whole sub-thread with speculation on the number of bits needed too encode benign vs. treacherous policies, but for me that discussion does not yet arrive at any place where I would get convinced that the problem of assigning higher priors to benign vs. treacherous policies has been made tractable yet. (Vanessa has identified some additional moving parts however.)
There is of course a tradition in the AI safety community where this is made ‘tractable’ by the device of polling many AI researchers to ask them whether they think that bits(benign policy)<bits(treacherous policy) for future ML systems, and then graphing the results, but this is not what I would call a tractable solution.
What I would call tractable is a solution like the one, for a much simpler case, in section 10.2.4 of my paper Counterfactual Planning in AGI Systems. I show there that random exploration can be used to make sure that bits(agent environment model which includes unwanted self-knowledge about agent compute core internals) ≫ bits(agent environment model that lacks this unwanted self-knowledge), no matter what the encoding. Extending this to the bits(benign policy) case would be nice, but I can’t immediately see a route here.
My answer to the above hypothetical bits(benign policy)<bits(treacherous policy) poll is that we cannot expect this to be true any possible encoding of policies (see counter-example above), but it might be true for some encodings. Figuring out where deep neural net encodings fit on the spectrum would be worthwhile.
Also. my answer to bits(benign policy)<bits(treacherous policy) would depend on whether the benign policy is supposed to be about making paperclips in the same way humans do, or about maximizing human values over the lifetime of the universe in ways that humans will not be able to figure out themselves.
For the paperclip making imitation policy, I am somewhat more optimistic about tractability than in the more general case.
I don’t think this is a problem. There will be plenty of them, but when they’re wrong they’ll get removed from the posterior.
I have seen you mention a number of times in this comment thread that ‘this is not a problem because eventually the bad/wrong policies will disappear from the top set’. You have not qualified this statement with ‘but we need a very low α like α<1/|Π| to make this work in a safe way’, so I remain somewhat uncertain about your views are about how low α needs to go.
In any case, I’ll now try to convince you that if α>1/|Π|, your statement that ‘when they’re wrong they’ll get removed from the posterior’ will not always mean what you might want it to mean.
Is the demonstrator policy πd to get themselves killed?
The interesting thing in developing these counterexamples is that they often show that the provable math in the paper gives you less safety than you would have hoped for.
Say that πp∈Π is the policy of producing paperclips in the manner demonstrated by the human demonstrator. Now, take my construction in the counterexample where α>1/|Π| and where at time step t, we have the likely case that πp∉Παh<t. In the world I constructed for the counterexample, the remaining top policies Παh<t now perform a synchronized treacherous turn where they kill the demonstrator.
In time step t+1 and later, the policies Παh<t+1 diverge a lot in what actions they will take, so the agent queries the demonstrator, who is now dead. The query will return the null action. This eventually removes all ‘wrong’ policies from Παh<t+1+i, where ‘wrong’ means that they do not take the null action at all future time steps.
The silver lining is perhaps that at least the agent will eventually stop, perform null actions only, after it has killed the demonstrator.
Now. the paper proves that the behavior of the agent policy πiα will approximate that of the true demonstrator policy πd closer and closer when time progresses. We therefore have to conclude that in the counterexample world, the true demonstrator policy πd had nothing to do with producing paperclips, this was a wrong guess all along. The right demonstrator policy πd is one where the demonstrator always intended to get themselves killed.
This would be a somewhat unusual solution to the inner alignment problem.
The math in the paper has you working in a fixed-policy setting where the demonstrator policy πd is immutable/time-invariant. The snag is that this does not imply that the policy πd defines a behavioral trajectory that is independent of the internals of the agent construction. If the agent is constructed in a particular way and when it operates in a certain environment, it will force πd into a self-fulfilling trajectory where it kills the demonstrator.
Side note: if anybody is looking for alternative math that allows one to study and manage the interplay between a mutable time-dependent demonstrator policy and the agent policy, causal models seem to be the way to go. See for example here where this is explored in a reward learning setting.