L Rudolf L comments on Deconfusing Direct vs Amortised Optimization

L Rudolf L 11 Jul 2024 15:55 UTC
2 points
0
I was very happy to find this post—it clarifies & names a concept I’ve been thinking about for a long time. However, I have confusions about the maths here:
Mathematically, direct optimization is your standard AIXI-like optimization process. For instance, suppose we are doing direct variational inference optimization to find a Bayesian posterior parameter $θ$ from a data-point $x$ , the mathematical representation of this is:
$\begin{matrix} θ_{direct}^{*} = {argmin}_{θ} K L [q (θ; x) | | p (x, θ)] \end{matrix}$
By contrast, the amortized objective optimizes some other set of parameters $\phi$ over a function approximator $^θ = f_{ϕ} (x)$ which directly maps from the data-point to an estimate of the posterior parameters $^θ .$ We then optimize the parameters of the function approximator $ϕ$ across a whole dataset $D = {(x_{1}, θ_{1}^{*}), (x_{2}, θ_{2}^{*}) \dots}$ of data-point and parameter examples.
$\begin{matrix} θ_{amortized}^{*} = {argmin}_{ϕ} E_{p (D)} [L (θ^{*}, f_{ϕ} (x))] \end{matrix}$
First of all, I don’t see how the given equation for direct optimization makes sense. $K L [q (θ; x) | | p (x, θ)]$ is comparing a distribution over $θ$ over a joint distribution over $(x, θ)$ . Should this be $K L [q_{ψ} (θ) | | p (θ | x)]$ for variational inference (where $ψ$ is whatever we’re using to parametrize the variational family), and $K L [q (θ | x) | | p (θ | x)]$ in general?
Secondly, why the focus on variational inference for defining direct optimization in the first place? Direct optimization is introduced as (emphasis mine):
Direct optimization occurs when optimization power is applied immediately and directly when engaged with a new situation to explicitly compute an on-the-fly optimal response – for instance, when directly optimizing against some kind of reward function. The classic example of this is planning and Monte-Carlo-Tree-Search (MCTS) algorithms [...]
This does not sound like we’re talking about algorithms that update parameters. If I had to put the above in maths, it just sounds like an argmin:
$g (x) = {a r g m i n}_{y \in A} L (y)$
where $g$ is your AI system, $A$ is whatever action space it can explore (you can make $A$ vary based on how much compute you’re wiling to spend, like with MCTS depth), $L$ is some loss function (it could be a reward function with a flipped sign, but I’m trying to keep it comparable to the direct optimization equation.
Also, the amortized optimization equation RHS is about defining a $ϕ$ , i.e. the parameters in your function approximator $f$ , but then the LHS calls it $θ_{a m o r t i z e d}^{*}$ , which is confusing to me. I also don’t understand why the loss function is taking in parameters $θ^{*}$ , or why the dataset contains parameters (is $θ$ being used throughout to stand for outputs rather than model parameters?).
To me, the natural way to phrase this concept would instead be as
$g (x) = f_{^ϕ} (x)$
where $g$ is your AI system, and $^ϕ = {a r g m i n}_{ϕ} E_{(x, y) \sim D} [L (y, f_{ϕ} (x))]$ , with the dataset $D = {(x_{1}, y_{1}), (x_{2}, y_{2}) \dots)}$ .
I’d be curious to hear any expansion of the motivation behind the exact maths in the post, or any way in which my version is misleading.
What links here?
- AI & wisdom 1: wisdom, amortised optimisation, and AI by L Rudolf L (28 Oct 2024 21:02 UTC; 27 points)
- AI & wisdom 1: wisdom, amortised optimisation, and AI by L Rudolf L (EA Forum; 29 Oct 2024 13:37 UTC; 14 points)