When reading this, I have a question of where between a quantiliser and optimiser amortised optimisation lies. Like, how much do we run into maximised VNM-utility style problems if we were to scale this up into AGI-like systems?
My vibe is that it seems less maximising than a pure RL version would, but then again, I’m not certain to what extent optimising for function approximation is different from optimising for a reward.
I think amortised optimisation doesn’t lie on the same spectrum as “quantiliser - (direct) optimiser” but is another dimension entirely. I.e. your question is like asking: “where between the x and y axis does the line for the z axis lie”?
Amortised optimisation is just a fundamentally different approach where we learn to approximate some function from a dataset and then just evaluate the learned function.
The behaviour of the amortised policy may look similar to a direct optimiser on the training distribution, but diverge arbitrarily far on another distribution where the correlation between the learned policy and a particular objective breaks down.
When reading this, I have a question of where between a quantiliser and optimiser amortised optimisation lies. Like, how much do we run into maximised VNM-utility style problems if we were to scale this up into AGI-like systems?
My vibe is that it seems less maximising than a pure RL version would, but then again, I’m not certain to what extent optimising for function approximation is different from optimising for a reward.
I think amortised optimisation doesn’t lie on the same spectrum as “quantiliser - (direct) optimiser” but is another dimension entirely. I.e. your question is like asking: “where between the x and y axis does the line for the z axis lie”?
Amortised optimisation is just a fundamentally different approach where we learn to approximate some function from a dataset and then just evaluate the learned function.
The behaviour of the amortised policy may look similar to a direct optimiser on the training distribution, but diverge arbitrarily far on another distribution where the correlation between the learned policy and a particular objective breaks down.