I was very happy to find this post—it clarifies & names a concept I’ve been thinking about for a long time. However, I have confusions about the maths here:
Mathematically, direct optimization is your standard AIXI-like optimization process. For instance, suppose we are doing direct variational inference optimization to find a Bayesian posterior parameter θ from a data-point x, the mathematical representation of this is:
θ∗direct=argminθKL[q(θ;x)||p(x,θ)]
By contrast, the amortized objective optimizes some other set of parameters $\phi$ over a function approximator ^θ=fϕ(x) which directly maps from the data-point to an estimate of the posterior parameters ^θ. We then optimize the parameters of the function approximator ϕ across a whole dataset D={(x1,θ∗1),(x2,θ∗2)…} of data-point and parameter examples.
θ∗amortized=argminϕEp(D)[L(θ∗,fϕ(x))]
First of all, I don’t see how the given equation for direct optimization makes sense.KL[q(θ;x)||p(x,θ)] is comparing a distribution over θ over a joint distribution over (x,θ). Should this be KL[qψ(θ)||p(θ|x)]for variational inference (where ψ is whatever we’re using to parametrize the variational family), and KL[q(θ|x)||p(θ|x)] in general?
Secondly, why the focus on variational inference for defining direct optimization in the first place? Direct optimization is introduced as (emphasis mine):
Direct optimization occurs when optimization power is applied immediately and directly when engaged with a new situation to explicitly compute an on-the-fly optimal response – for instance, when directly optimizing against some kind of reward function. The classic example of this is planning and Monte-Carlo-Tree-Search (MCTS) algorithms [...]
This does not sound like we’re talking about algorithms that update parameters. If I had to put the above in maths, it just sounds like an argmin:
g(x)=argminy∈AL(y)
where g is your AI system, A is whatever action space it can explore (you can make A vary based on how much compute you’re wiling to spend, like with MCTS depth), L is some loss function (it could be a reward function with a flipped sign, but I’m trying to keep it comparable to the direct optimization equation.
Also, the amortized optimization equation RHS is about defining a ϕ, i.e. the parameters in your function approximator f, but then the LHS calls it θ∗amortized, which is confusing to me. I also don’t understand why the loss function is taking in parameters θ∗, or why the dataset contains parameters (is θ being used throughout to stand for outputs rather than model parameters?).
To me, the natural way to phrase this concept would instead be as
g(x)=f^ϕ(x)
where g is your AI system, and ^ϕ=argminϕE(x,y)∼D[L(y,fϕ(x))], with the dataset D={(x1,y1),(x2,y2)…)}.
I’d be curious to hear any expansion of the motivation behind the exact maths in the post, or any way in which my version is misleading.
I was very happy to find this post—it clarifies & names a concept I’ve been thinking about for a long time. However, I have confusions about the maths here:
First of all, I don’t see how the given equation for direct optimization makes sense.KL[q(θ;x)||p(x,θ)] is comparing a distribution over θ over a joint distribution over (x,θ). Should this be KL[qψ(θ)||p(θ|x)]for variational inference (where ψ is whatever we’re using to parametrize the variational family), and KL[q(θ|x)||p(θ|x)] in general?
Secondly, why the focus on variational inference for defining direct optimization in the first place? Direct optimization is introduced as (emphasis mine):
This does not sound like we’re talking about algorithms that update parameters. If I had to put the above in maths, it just sounds like an argmin:
g(x)=argminy∈AL(y)
where g is your AI system, A is whatever action space it can explore (you can make A vary based on how much compute you’re wiling to spend, like with MCTS depth), L is some loss function (it could be a reward function with a flipped sign, but I’m trying to keep it comparable to the direct optimization equation.
Also, the amortized optimization equation RHS is about defining a ϕ, i.e. the parameters in your function approximator f, but then the LHS calls it θ∗amortized, which is confusing to me. I also don’t understand why the loss function is taking in parameters θ∗, or why the dataset contains parameters (is θ being used throughout to stand for outputs rather than model parameters?).
To me, the natural way to phrase this concept would instead be as
g(x)=f^ϕ(x)
where g is your AI system, and ^ϕ=argminϕE(x,y)∼D[L(y,fϕ(x))], with the dataset D={(x1,y1),(x2,y2)…)}.
I’d be curious to hear any expansion of the motivation behind the exact maths in the post, or any way in which my version is misleading.