Here’s a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below):
and the problem becomes that both Mdeceptive and Maligned will produce behavior that looks aligned on the training distribution, but Ualigned has to be much more complex. To see this, note that essentially any Udeceptive will yield good training performance because the model will choose to act deceptively during training, whereas if you want to get good training performance without deception, then Ualigned has to actually encode the full objective, which is likely to make it quite complicated.
I understand how this model explains why agents become unaligned under distributional shift. That’s something I never disputed. However, I don’t understand how this model applies to my proposal. In my proposal, there is no distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with C. The model can’t choose to act deceptively during training, because it can’t distinguish between training and deployment. Moreover, the objective I described is not complicated.
Yeah, that’s a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.
Okay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.
Perhaps I just totally don’t understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn’t matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there’s some deceptive model with greater prior probability that’s indistinguishable on the training distribution, as in my simple toy model from earlier.
When talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment also consists of sampling an environment from the same prior, training and deployment are indistinguishable.
Sure—by that definition of realizability, I agree that’s where the difficulty is. Though I would seriously question the practical applicability of such an assumption.
Here’s a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below):
Mdeceptive(x)=argmaxaE[∑iUdeceptive(si) | a]Maligned(x)=argmaxaE[∑iUaligned(si) | a]
Then, comparing the K-complexity of the two models, we get
K(Maligned)−K(Mdeceptive)≈K(Ualigned)−K(Udeceptive)
and the problem becomes that both Mdeceptive and Maligned will produce behavior that looks aligned on the training distribution, but Ualigned has to be much more complex. To see this, note that essentially any Udeceptive will yield good training performance because the model will choose to act deceptively during training, whereas if you want to get good training performance without deception, then Ualigned has to actually encode the full objective, which is likely to make it quite complicated.
I understand how this model explains why agents become unaligned under distributional shift. That’s something I never disputed. However, I don’t understand how this model applies to my proposal. In my proposal, there is no distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with C. The model can’t choose to act deceptively during training, because it can’t distinguish between training and deployment. Moreover, the objective I described is not complicated.
Yeah, that’s a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.
Okay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.
Perhaps I just totally don’t understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn’t matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there’s some deceptive model with greater prior probability that’s indistinguishable on the training distribution, as in my simple toy model from earlier.
When talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment also consists of sampling an environment from the same prior, training and deployment are indistinguishable.
Sure—by that definition of realizability, I agree that’s where the difficulty is. Though I would seriously question the practical applicability of such an assumption.