Johannes Treutlein comments on An observation about Hubinger et al.’s framework for learned optimization

Johannes Treutlein 2 Aug 2022 3:47 UTC
LW: 1 AF: 1
AF
Your post seems to be focused more on pointing out a missing piece in the literature rather than asking for a solution to the specific problem (which I believe is a valuable contribution). Regardless, here is roughly how I would understand “what they mean”:
Let $X$ be the task space, $O$ the output space, $S = O^{X}$ the model space, $f : S \to R$ our base objective, and $g_{x} : Σ \to R$ the mesa objective of the model for input $x \in X$ . Assume that there exists some map $φ : Σ \to O$ mapping internal objects to outputs by the model, such that $m (x) = φ ({a r g m a x}_{σ \in Σ} g_{x} (σ))$ .
Given this setup, how can we reconcile $f$ and $g$ ? Assume some distribution $ν$ over the task space is given. Moreover, assume there exists a function $u^{*} : X \to R^{O}$ mapping tasks to utility functions over outputs, such that $f (m) = E_{x \sim ν} [u^{*} (x) (m (x))]$ . Then we could define a mesa objective as $u_{m} : X \to R^{O}$ where $u_{m} (x) (o) := {max}_{σ \in φ^{- 1} ({o})} g_{x} (σ)$ if $φ^{- 1} ({o}) \neq \emptyset$ and otherwise we define $u_{m} (x) (o)$ as some very small number or $- \infty$ (and replace $R$ by $R \cup {- \infty}$ above). We can then compare $u_{m}$ and $u^{*}$ directly via some distance on the spaces $X$ and $O$ .
Why would such a function $u^{*}$ exist? In stochastic gradient descent, for instance, we are in fact evaluating models based on the outputs they produce on tasks distributed according to some distribution $ν$ . Moreover, such a function should probably exist given some regularity conditions imposed on an arbitrary objective $f$ (inspired by the axioms of expected utility theory).
Why would a function $φ$ exist? Some function connecting outputs to the internal search space has to exist because the model is producing outputs. In practice, the model might not optimize $g_{x}$ perfectly and thus might not always choose the argmax (potentially leading to suboptimality alignment), but this could probably still be accounted for somehow in this model. Moreover, $φ$ could theoretically differ between different inputs, but again one could probably change definitions in some way to make things work.
If $m$ is a mesa-optimizer, then there should probably be some way to make sense of the mathematical objects describing mesa objective, search space, and model outputs as described above. Of course, how to do this exactly, especially for more general mesa-optimizers that only optimize objectives approximately, etc., still needs to be worked out more.
- Johannes Treutlein 2 Aug 2022 20:53 UTC
  LW: 1 AF: 1
  AF Parent
  These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models.