“weight #9876 is 1.2345 because out of all possible models, the highest-scoring model is one where weight #9876 happens to be 1.2345”
“weight #9876 is 1.2345 because the hardware running this model has a RowHammer vulnerability, and this weight is part of a strategy that exploits that. (So in a counterfactual universe where we made chips slightly differently such that there was no such thing as RowHammer, then weight #9876 would absolutely NOT be 1.2345.)”
The second one doesn’t stop being true because the first one is also true. They can both be true, right?
In other words, “the model weights are what they are because it’s the simplest way to solve the problem” doesn’t eliminate other “why” questions about all the details of the model. There’s still some story about why the weights (and the resulting processing steps) are what they are—it may be a very complicated story, but there should (I think) still be a fact of the matter about whether that story involves “the algorithm itself having downstream impacts on the future in non-random ways that can’t be explained away by the algorithm logic itself or the real-world things upstream of the algorithm”. Or something like that, I think.
Sure, that’s fair. But in the post, you argue that this sort of non-in-universe-processing won’t happen because there’s no incentive for it:
It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.
However, if there’s another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.
One thing is, I’m skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there’s a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don’t think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
The robot is supposed to be really good at manufacturing widgets (for example), and that task requires real-world foresighted planning, because sometimes it needs to substitute different materials, negotiate with suppliers and customers, repair itself, etc. Given that the model definitely needs to have capability of real-world foresighted planning and self-awareness and so on, the simplest high-performing model is plausibly one that applies those capabilities towards a maximally simple goal, like “making its camera pixels all white” or whatever, and then that preserves performance because of instrumental convergence.
(Correct me if I’m misunderstanding!)
If that’s the argument, it seems not to apply here, because this task doesn’t require real-world foresighted planning.
I expect that a model that can’t do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn’t matter, because a model that can’t do any real-world planning at all would do terribly on the objective, so who cares if it’s simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
I think it can be simultaneously true that, say:
“weight #9876 is 1.2345 because out of all possible models, the highest-scoring model is one where weight #9876 happens to be 1.2345”
“weight #9876 is 1.2345 because the hardware running this model has a RowHammer vulnerability, and this weight is part of a strategy that exploits that. (So in a counterfactual universe where we made chips slightly differently such that there was no such thing as RowHammer, then weight #9876 would absolutely NOT be 1.2345.)”
The second one doesn’t stop being true because the first one is also true. They can both be true, right?
In other words, “the model weights are what they are because it’s the simplest way to solve the problem” doesn’t eliminate other “why” questions about all the details of the model. There’s still some story about why the weights (and the resulting processing steps) are what they are—it may be a very complicated story, but there should (I think) still be a fact of the matter about whether that story involves “the algorithm itself having downstream impacts on the future in non-random ways that can’t be explained away by the algorithm logic itself or the real-world things upstream of the algorithm”. Or something like that, I think.
Sure, that’s fair. But in the post, you argue that this sort of non-in-universe-processing won’t happen because there’s no incentive for it:
However, if there’s another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.
One thing is, I’m skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there’s a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don’t think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
(Correct me if I’m misunderstanding!)
If that’s the argument, it seems not to apply here, because this task doesn’t require real-world foresighted planning.
I expect that a model that can’t do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn’t matter, because a model that can’t do any real-world planning at all would do terribly on the objective, so who cares if it’s simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
Yup, that’s basically my objection.