Sure, that’s fair. But in the post, you argue that this sort of non-in-universe-processing won’t happen because there’s no incentive for it:
It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.
However, if there’s another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.
One thing is, I’m skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there’s a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don’t think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
The robot is supposed to be really good at manufacturing widgets (for example), and that task requires real-world foresighted planning, because sometimes it needs to substitute different materials, negotiate with suppliers and customers, repair itself, etc. Given that the model definitely needs to have capability of real-world foresighted planning and self-awareness and so on, the simplest high-performing model is plausibly one that applies those capabilities towards a maximally simple goal, like “making its camera pixels all white” or whatever, and then that preserves performance because of instrumental convergence.
(Correct me if I’m misunderstanding!)
If that’s the argument, it seems not to apply here, because this task doesn’t require real-world foresighted planning.
I expect that a model that can’t do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn’t matter, because a model that can’t do any real-world planning at all would do terribly on the objective, so who cares if it’s simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
Sure, that’s fair. But in the post, you argue that this sort of non-in-universe-processing won’t happen because there’s no incentive for it:
However, if there’s another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.
One thing is, I’m skeptical that a deceptive non-in-universe-processing model would be simpler for the same performance. Or at any rate, there’s a positive case for the simplicity of deceptive alignment, and I find that case very plausible for RL robots, but I don’t think it applies to this situation. The positive case for simplicity of deceptive models for RL robots is something like (IIUC):
(Correct me if I’m misunderstanding!)
If that’s the argument, it seems not to apply here, because this task doesn’t require real-world foresighted planning.
I expect that a model that can’t do any real-world planning at all would be simpler than a model that can. In the RL robot example, it doesn’t matter, because a model that can’t do any real-world planning at all would do terribly on the objective, so who cares if it’s simpler. But here, it would be equally good at the objective, I think, and simpler.
(A possible objection would be: “real-world foresighted planning” isn’t a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like “building predictive models” and “searching over strategies” and whatnot. I think I would disagree with that objection, but I don’t have great certainty here.)
Yup, that’s basically my objection.