The kind of incentive argument I’m trying to make here is “If the model isn’t doing X, then by doing X a little bit it will score better on the objective, and by doing X more it will score even better on the objective, etc. etc.” That’s what I mean by “X is incentivized”. (Or more generally, that gradient descent systematically tends to lead to trained models that do X.) I guess my description in the article was not great.
So in general, I think deceptive alignment is “incentivized” in this sense. I think that, in the RL scenarios you talked about in your paper, it’s often the case that building a better and better deceptively-aligned mesa-optimizer will progressively increase the score on the objective function.
Then my argument here is that 4th-wall-breaking processing is not incentivized in that sense: if the trained model isn’t doing 4th-wall-breaking processing at all right now, I think it does not do any better on the objective by starting to do a little bit of 4th-wall-breaking processing. (At least that’s my hunch.)
(I do agree that if a deceptively-aligned mesa-optimizer with a 4th-wall-breaking objective magically appeared as the trained model, it would do well on the objective. I’m arguing instead that SGD is unlikely to create such a thing.)
Oh, I guess you’re saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property “no operation in the algorithm is likelier to happen vs not happen specificallybecause of anticipated downstream chains of causation that pass through things in the real world”. So I can say categorically: such an algorithm won’t hurt anyone (except by freak accident), won’t steal processing resources, won’t intervene when I go for the off-switch, etc., right? So I don’t see “arbitrarily scary”, or scary at all, right? Sorry if I’m confused…
Oh, I guess you’re saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property “no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world”.
Yep, that’s right.
So I can say categorically: such an algorithm won’t hurt anyone (except by freak accident), won’t steal processing resources, won’t intervene when I go for the off-switch, etc., right?
No, not at all—just because an algorithm wasn’t selected based on causing something to happen in the real world doesn’t mean it won’t in fact try to make things happen in the real world. In particular, the reason that I expect deception in practice is not primarily because it’ll actually be selected for, but primarily just because it’s simpler, and so it’ll be found despite the fact that there wasn’t any explicit selection pressure in favor of it. See: “Does SGD Produce Deceptive Alignment?”
(1) SGD under such-and-such conditions will lead to a trained model that does exclusively within-universe processing [this step is really just a low-confidence hunch but I’m still happy to discuss and defend it]
(2) trained models that do exclusively within-universe processing are not scary [this step I have much higher confidence in]
If you’re going to disagree with (2), then SGD / “what the model was selected” for is not relevant.
“Doing exclusively within-universe processing” is a property of the internals of the trained model, not just the input-output behavior. If running the trained model involves a billion low-level GPU instructions, this property would correspond to the claim that each and every one of those billion GPU instructions is being executed for reasons that are unrelated to any anticipated downstream real-world consequences of that GPU instruction. (where “real world” = everything except the future processing steps inside the algorithm itself.)
I mean, I guess it depends on your definition of “unrelated to any anticipated downstream real-world consequences.” Does the reason “it’s the simplest way to solve the problem in the training environment” count as “unrelated” to real-world consequences? My point is that it seems like it should, since it’s just about description length, not real-world consequences—but that it could nevertheless yield arbitrarily bad real-world consequences.
“weight #9876 is 1.2345 because out of all possible models, the highest-scoring model is one where weight #9876 happens to be 1.2345”
“weight #9876 is 1.2345 because the hardware running this model has a RowHammer vulnerability, and this weight is part of a strategy that exploits that. (So in a counterfactual universe where we made chips slightly differently such that there was no such thing as RowHammer, then weight #9876 would absolutely NOT be 1.2345.)”
The second one doesn’t stop being true because the first one is also true. They can both be true, right?
In other words, “the model weights are what they are because it’s the simplest way to solve the problem” doesn’t eliminate other “why” questions about all the details of the model. There’s still some story about why the weights (and the resulting processing steps) are what they are—it may be a very complicated story, but there should (I think) still be a fact of the matter about whether that story involves “the algorithm itself having downstream impacts on the future in non-random ways that can’t be explained away by the algorithm logic itself or the real-world things upstream of the algorithm”. Or something like that, I think.
Sure, that’s fair. But in the post, you argue that this sort of non-in-universe-processing won’t happen because there’s no incentive for it:
It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.
However, if there’s another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.
One thing is, I’m skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there’s a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don’t think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
The robot is supposed to be really good at manufacturing widgets (for example), and that task requires real-world foresighted planning, because sometimes it needs to substitute different materials, negotiate with suppliers and customers, repair itself, etc. Given that the model definitely needs to have capability of real-world foresighted planning and self-awareness and so on, the simplest high-performing model is plausibly one that applies those capabilities towards a maximally simple goal, like “making its camera pixels all white” or whatever, and then that preserves performance because of instrumental convergence.
(Correct me if I’m misunderstanding!)
If that’s the argument, it seems not to apply here, because this task doesn’t require real-world foresighted planning.
I expect that a model that can’t do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn’t matter, because a model that can’t do any real-world planning at all would do terribly on the objective, so who cares if it’s simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
Thanks!
The kind of incentive argument I’m trying to make here is “If the model isn’t doing X, then by doing X a little bit it will score better on the objective, and by doing X more it will score even better on the objective, etc. etc.” That’s what I mean by “X is incentivized”. (Or more generally, that gradient descent systematically tends to lead to trained models that do X.) I guess my description in the article was not great.
So in general, I think deceptive alignment is “incentivized” in this sense. I think that, in the RL scenarios you talked about in your paper, it’s often the case that building a better and better deceptively-aligned mesa-optimizer will progressively increase the score on the objective function.
Then my argument here is that 4th-wall-breaking processing is not incentivized in that sense: if the trained model isn’t doing 4th-wall-breaking processing at all right now, I think it does not do any better on the objective by starting to do a little bit of 4th-wall-breaking processing. (At least that’s my hunch.)
(I do agree that if a deceptively-aligned mesa-optimizer with a 4th-wall-breaking objective magically appeared as the trained model, it would do well on the objective. I’m arguing instead that SGD is unlikely to create such a thing.)
Oh, I guess you’re saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property “no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world”. So I can say categorically: such an algorithm won’t hurt anyone (except by freak accident), won’t steal processing resources, won’t intervene when I go for the off-switch, etc., right? So I don’t see “arbitrarily scary”, or scary at all, right? Sorry if I’m confused…
Yep, that’s right.
No, not at all—just because an algorithm wasn’t selected based on causing something to happen in the real world doesn’t mean it won’t in fact try to make things happen in the real world. In particular, the reason that I expect deception in practice is not primarily because it’ll actually be selected for, but primarily just because it’s simpler, and so it’ll be found despite the fact that there wasn’t any explicit selection pressure in favor of it. See: “Does SGD Produce Deceptive Alignment?”
I think you’re misunderstanding (or I am).
I’m trying to make a two step argument:
(1) SGD under such-and-such conditions will lead to a trained model that does exclusively within-universe processing [this step is really just a low-confidence hunch but I’m still happy to discuss and defend it]
(2) trained models that do exclusively within-universe processing are not scary [this step I have much higher confidence in]
If you’re going to disagree with (2), then SGD / “what the model was selected” for is not relevant.
“Doing exclusively within-universe processing” is a property of the internals of the trained model, not just the input-output behavior. If running the trained model involves a billion low-level GPU instructions, this property would correspond to the claim that each and every one of those billion GPU instructions is being executed for reasons that are unrelated to any anticipated downstream real-world consequences of that GPU instruction. (where “real world” = everything except the future processing steps inside the algorithm itself.)
I mean, I guess it depends on your definition of “unrelated to any anticipated downstream real-world consequences.” Does the reason “it’s the simplest way to solve the problem in the training environment” count as “unrelated” to real-world consequences? My point is that it seems like it should, since it’s just about description length, not real-world consequences—but that it could nevertheless yield arbitrarily bad real-world consequences.
I think it can be simultaneously true that, say:
“weight #9876 is 1.2345 because out of all possible models, the highest-scoring model is one where weight #9876 happens to be 1.2345”
“weight #9876 is 1.2345 because the hardware running this model has a RowHammer vulnerability, and this weight is part of a strategy that exploits that. (So in a counterfactual universe where we made chips slightly differently such that there was no such thing as RowHammer, then weight #9876 would absolutely NOT be 1.2345.)”
The second one doesn’t stop being true because the first one is also true. They can both be true, right?
In other words, “the model weights are what they are because it’s the simplest way to solve the problem” doesn’t eliminate other “why” questions about all the details of the model. There’s still some story about why the weights (and the resulting processing steps) are what they are—it may be a very complicated story, but there should (I think) still be a fact of the matter about whether that story involves “the algorithm itself having downstream impacts on the future in non-random ways that can’t be explained away by the algorithm logic itself or the real-world things upstream of the algorithm”. Or something like that, I think.
Sure, that’s fair. But in the post, you argue that this sort of non-in-universe-processing won’t happen because there’s no incentive for it:
However, if there’s another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.
One thing is, I’m skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there’s a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don’t think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
(Correct me if I’m misunderstanding!)
If that’s the argument, it seems not to apply here, because this task doesn’t require real-world foresighted planning.
I expect that a model that can’t do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn’t matter, because a model that can’t do any real-world planning at all would do terribly on the objective, so who cares if it’s simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
Yup, that’s basically my objection.