Well, I still don’t find your argument convincing. You haven’t given any instrumental convergence theorem, nor have you updated your informal instrumental convergence argument to bypass my objection.
Hm I don’t think your objection applies to what I’ve written? I don’t assume anything about using a loss like L. In the post I explicitly talk about offline training where the data distribution is fixed.
Taking a guess at where the disagreement lies, I think it’s where you say
And L∗ seems much more tame than L to me.
L∗ does not in fact look ‘tame’ (by which I mean safe to optimize) to me. I’m happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.
You haven’t given any instrumental convergence theorem
I wish :) I’m not nearly as certain of anything I say in this post as I’d be of a theorem!
The worry is that the predictive model will output suboptimal predictions in the immediate run in order to set up conditions for better predictions later.
Now, suppose somehow some part of the predictive model gets the idea to do that. In that case, the predictions will be, well, suboptimal; it will make errors, so this part of the predictive model will have a negative gradient against it. If we were optimizing it to be agentic (e.g. using L), this negative gradient would be counterbalanced by a positive gradient that could strongly reinforce it. But since we’re not doing that, there’s nothing to counteract the negative gradient that removes the inner optimizer.
Hm I don’t think your objection applies to what I’ve written? I don’t assume anything about using a loss like L. In the post I explicitly talk about offline training where the data distribution is fixed.
Well, you assume you’ll end up with a consequentialist reasoner with an inner objective along the lines of L.
L∗ does not in fact look ‘tame’ (by which I mean safe to optimize) to me. I’m happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.
Suppose the model outputs a prediction that makes future predictions easier somehow. What effect will that have on L∗? Well, L∗(μ)=Lμ(μ)−maxmLμ(m), and it may increase Lμ(μ), so you might think it would be expected to increase L∗. But presumably it would also increase maxmLμ(m), cancelling out the increase in Lμ(μ).
But since we’re not doing that, there’s nothing to counteract the negative gradient that removes the inner optimizer.
During training, the inner optimizer has the same behavior as the benign model: while it’s still dumb it just doesn’t know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.
So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don’t know how the inductive biases will work out here). Once the consequentialist acts and takes over the training process it is already too late.
Re: tameness of L∗(μ)=Lμ(μ)−minmLμ(m) (using min cause L is a loss), some things that come to mind are
a) L∗ is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus Lμ(μ)≈minmLμ(m).
b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)
During training, the inner optimizer has the same behavior as the benign model: while it’s still dumb it just doesn’t know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.
You’re still assuming that you have a perfect consequentialist trapped in a box.
And sure, if you have an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, and if not in training does some sort of dangerous consequentialist thing, then that AI will do well in the loss function and end up doing some sort of dangerous consequentialist thing once deployed.
But that’s not specific to doing some sort of dangerous consequentialist thing. If you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise throws null pointer exceptions, then that AI will also do well in the loss function but end up throwing null pointer exceptions once deployed. Or if you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise shows a single image of a paperclip, then again you have an AI that does well in the loss function but ends up throwing null pointer exceptions once deployed.
The magical step we’re missing is, why would we end up with a perfect consequentialist in a box? That seems like a highly specific hypothesis for what the predictor would do. And if I try to reason about it mechanistically, it doesn’t seem like the standard ways AI gets made, i.e. by gradient descent, would generate that.
Because with gradient descent, you try a bunch of AIs that partly work, and then move in the direction that works better. And so with gradient descent, before you have a perfect consequentialist that can accurately predict whether it’s in training, you’re going to have an imperfect consequentialist that cannot accurately predict whether it’s in training. And this might sometimes accidentally decide that it’s not in training, and output a prediction that’s “intended” to control the world at the cost of some marginal prediction accuracy, and then the gradient is going to notice that something is wrong and is going to turn down the consequentialist. (And yes, this would also encourage deception, but come on, what’s easier—“don’t do advanced planning for how to modify the world and use this to shift your predictions” or “do advanced planning for how to do advanced planning for how to modify the world using your predictions without getting caught”?)
Re: tameness of L∗(μ)=Lμ(μ)−minmLμ(m) (using min cause L is a loss), some things that come to mind are
a) L∗ is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus Lμ(μ)≈minmLμ(m).
This works as an optimum for L∗, but here you then have to go for another layer of analysis.L∗ measures the degree to which something is a fix point for the training equation, but obviously only a stable fixed point would actually be reached during the training process. So that raises the question, is the optimum you propose here a stable fixed point?
Let’s consider some strategy that is almost perfectly what you describe. Its previous predictions have caused enough chaos to force the variable it has to predict to be almost random—but not quite. It can now spend its marginal resources on two things:
Introduce even more chaos, likely at the expense of immediate predictive power
Predict the little bit of signal, likely at the expense of being unable to make as much chaos
Due to the myopia, gradient descent will favor the latter and completely ignore the former. But the latter moves it away from the fixed point, while the former moves it towards the fixed point. So your proposed fixed point is unstable.
b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)
The other models would also get access to this compute, that’s sort of the point of the model.
Well, I still don’t find your argument convincing. You haven’t given any instrumental convergence theorem, nor have you updated your informal instrumental convergence argument to bypass my objection.
Hm I don’t think your objection applies to what I’ve written? I don’t assume anything about using a loss like L. In the post I explicitly talk about offline training where the data distribution is fixed.
Taking a guess at where the disagreement lies, I think it’s where you say
L∗ does not in fact look ‘tame’ (by which I mean safe to optimize) to me. I’m happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.
I wish :) I’m not nearly as certain of anything I say in this post as I’d be of a theorem!
Fundamentally, the problem is this:
The worry is that the predictive model will output suboptimal predictions in the immediate run in order to set up conditions for better predictions later.
Now, suppose somehow some part of the predictive model gets the idea to do that. In that case, the predictions will be, well, suboptimal; it will make errors, so this part of the predictive model will have a negative gradient against it. If we were optimizing it to be agentic (e.g. using L), this negative gradient would be counterbalanced by a positive gradient that could strongly reinforce it. But since we’re not doing that, there’s nothing to counteract the negative gradient that removes the inner optimizer.
Well, you assume you’ll end up with a consequentialist reasoner with an inner objective along the lines of L.
Suppose the model outputs a prediction that makes future predictions easier somehow. What effect will that have on L∗? Well, L∗(μ)=Lμ(μ)−maxmLμ(m), and it may increase Lμ(μ), so you might think it would be expected to increase L∗. But presumably it would also increase maxmLμ(m), cancelling out the increase in Lμ(μ).
During training, the inner optimizer has the same behavior as the benign model: while it’s still dumb it just doesn’t know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.
So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don’t know how the inductive biases will work out here). Once the consequentialist acts and takes over the training process it is already too late.
Re: tameness of L∗(μ)=Lμ(μ)−minmLμ(m) (using min cause L is a loss), some things that come to mind are
a) L∗ is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus Lμ(μ)≈minmLμ(m).
b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)
(probably this list can be extended)
You’re still assuming that you have a perfect consequentialist trapped in a box.
And sure, if you have an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, and if not in training does some sort of dangerous consequentialist thing, then that AI will do well in the loss function and end up doing some sort of dangerous consequentialist thing once deployed.
But that’s not specific to doing some sort of dangerous consequentialist thing. If you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise throws null pointer exceptions, then that AI will also do well in the loss function but end up throwing null pointer exceptions once deployed. Or if you’ve got an AI that accurately guesses whether it’s in training or not, and if in training performs predictions as intended, but otherwise shows a single image of a paperclip, then again you have an AI that does well in the loss function but ends up throwing null pointer exceptions once deployed.
The magical step we’re missing is, why would we end up with a perfect consequentialist in a box? That seems like a highly specific hypothesis for what the predictor would do. And if I try to reason about it mechanistically, it doesn’t seem like the standard ways AI gets made, i.e. by gradient descent, would generate that.
Because with gradient descent, you try a bunch of AIs that partly work, and then move in the direction that works better. And so with gradient descent, before you have a perfect consequentialist that can accurately predict whether it’s in training, you’re going to have an imperfect consequentialist that cannot accurately predict whether it’s in training. And this might sometimes accidentally decide that it’s not in training, and output a prediction that’s “intended” to control the world at the cost of some marginal prediction accuracy, and then the gradient is going to notice that something is wrong and is going to turn down the consequentialist. (And yes, this would also encourage deception, but come on, what’s easier—“don’t do advanced planning for how to modify the world and use this to shift your predictions” or “do advanced planning for how to do advanced planning for how to modify the world using your predictions without getting caught”?)
This works as an optimum for L∗, but here you then have to go for another layer of analysis.L∗ measures the degree to which something is a fix point for the training equation, but obviously only a stable fixed point would actually be reached during the training process. So that raises the question, is the optimum you propose here a stable fixed point?
Let’s consider some strategy that is almost perfectly what you describe. Its previous predictions have caused enough chaos to force the variable it has to predict to be almost random—but not quite. It can now spend its marginal resources on two things:
Introduce even more chaos, likely at the expense of immediate predictive power
Predict the little bit of signal, likely at the expense of being unable to make as much chaos
Due to the myopia, gradient descent will favor the latter and completely ignore the former. But the latter moves it away from the fixed point, while the former moves it towards the fixed point. So your proposed fixed point is unstable.
The other models would also get access to this compute, that’s sort of the point of the model.