Isn’t the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior? Also, I don’t see what this objective has to do with learning a world model.
Also, I don’t see what this objective has to do with learning a world model.
The idea is to address a particular reason that your learned model would “copy a human” rather than “try to answer the question well.” Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it’s not disfavored by the prior.
I don’t think it’s intrinsically related to learning a world model, it’s just an attempt to fix a particular problem.
To the extent that there is a problem with the proposed approach—either a reason that this isn’t a real problem in the standard approach, or a reason that this proposed approach couldn’t address the problem (or would inevitably introduce some other problem)---then I’m interested in that.
Isn’t the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior?
Why would it be maximized there? Isn’t it at least better to make θ1=θ2+θ02?
And then in the section I’m trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having θ1 push apart the two heads in such a way that improving the quality of the model pushes them back together. I’m interested in anything that seems wrong in that argument.
(I don’t particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I’m pretty interested in this general approach.)
Two caveats were: (i) this isn’t going to actually end up actually making any alternative models lower loss, it’s just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to be plausible you need to have a stop grad on one of the heads in the computation of C, I maybe shouldn’t have push that detail so late.
D’oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here’s my current understanding.
Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there’s more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2).
And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and we want theta2 to contain the full world model and high data-likelihood (for one of the head) and the two heads agree. Based on Step1, the problem is still pretty underconstrained, but maybe that’s resolved in Step 2.
Isn’t the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior? Also, I don’t see what this objective has to do with learning a world model.
The idea is to address a particular reason that your learned model would “copy a human” rather than “try to answer the question well.” Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it’s not disfavored by the prior.
I don’t think it’s intrinsically related to learning a world model, it’s just an attempt to fix a particular problem.
To the extent that there is a problem with the proposed approach—either a reason that this isn’t a real problem in the standard approach, or a reason that this proposed approach couldn’t address the problem (or would inevitably introduce some other problem)---then I’m interested in that.
Why would it be maximized there? Isn’t it at least better to make θ1=θ2+θ02?
And then in the section I’m trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having θ1 push apart the two heads in such a way that improving the quality of the model pushes them back together. I’m interested in anything that seems wrong in that argument.
(I don’t particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I’m pretty interested in this general approach.)
Two caveats were: (i) this isn’t going to actually end up actually making any alternative models lower loss, it’s just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to be plausible you need to have a stop grad on one of the heads in the computation of C, I maybe shouldn’t have push that detail so late.
D’oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here’s my current understanding.
Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there’s more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2).
And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and we want theta2 to contain the full world model and high data-likelihood (for one of the head) and the two heads agree. Based on Step1, the problem is still pretty underconstrained, but maybe that’s resolved in Step 2.