drocta comments on An observation about Hubinger et al.’s framework for learned optimization

drocta 14 May 2022 22:53 UTC
1 point
For $m : S$ such that $m$ is a mesa=optimizer let $Σ_{m}$ be the space it optimizes over, and $g_{m} : Σ_{m} \to R$ be its utility function .
I know you said “which we need not notate”, but I am going to say that for $s : S$ and $x : X$ , that $s (x) : A$ , and $A$ is the space of actions (or possibly, $s (x) : A_{x}$ and $A_{x}$ is the space of actions available in the situation $x$ )
(Though maybe you just meant that we need note notate separately from s, the map from X to A which s defines. In which case, I agree, and as such I’m writing $s (x) : A$ instead of saying that something belongs to the function space $X \to A$ . )
For $m : S$ to have its optimization over $Σ_{m}$ have any relevance, there has to be some connection between the chosen $σ : Σ_{m}$ (chosen by m) , and $m (x)$ .
So, the process by which m produces m(x) when given x, should involve the selected $σ$ .
Moreover, the selection of the $σ$ ought to depend on x in some way, as otherwise the choice of $σ$ is constant each time, and can be regarded as just a constant value in how m functions.
So, it seems that what I said was $g_{m} : Σ_{m} \to R$ should instead be either $g_{m} : X \times Σ_{m} \to R$ , or $g_{m, x} : Σ_{m, x} \to R$ (in the latter case I suppose one might say $g_{m} : \sum x : X (Σ_{m, x}) \to R$ )
Call the process that produces the action $m (x) : A$ using the choice of $σ : Σ_{m}$ by the name $h_{m} : X \times Σ_{m} \to A$
(or more generally, $h_{m} : \sum x : X (Σ_{m, x}) \to A$ ) .
$h_{m}$ is allowed to also use randomness in addition to $x$ and $σ$ . I’m not assuming that it is a deterministic function. Though come to think of it, I’m not sure why it would need to be non-deterministic? Oh well, regardless.
Presumably whatever $f : S \to R$ is being used to select $s : S$ , depends primarily (though not necessarily exclusively) on what s(x) is for various values of x, or at least on something which indicates things about that, as f is supposed to be for selecting systems which take good actions?
Supposing that for the mesa-optimizer $m$ that the inner optimization procedure (which I don’t have a symbol for) and the inner optimization goal (i.e. $g_{m}$ ) are separate enough, one could ask “what if we had m, except with $g_{m}$ replaced with $g_{m}^{'}$ , and looked at how the outputs of $h_{m} (x, σ)$ and $h_{m} (x, σ^{'})$ differ, where $σ$ and $σ^{'}$ are respectively are selected (by m’s optimizer) by optimizing for the goals $g_{m}$ , and $g_{m}^{'}$ respectively?”.
Supposing that we can isolate the part of how f(s) depends on s which is based on what $s (x)$ is or tends to be for different values of $x$ , then there would be a “how would $f (m)$ differ if m used $g_{m}^{'}$ instead of $g_{m}$ ?”.
If $g_{m}^{'}$ in place of $g_{m}$ would result in things which, according to how $f$ works, would be better, then it seems like it would make sense to say that $g_{m}$ isn’t fully aligned with $f$ ?
Of course, what I just described makes a number of assumptions which are questionable:
- It assumes that there is a well-defined optimization procedure that m uses which is cleanly separable from the goal which it optimizes for
- It assumes that how f depends on s can be cleanly separated into a part which depends on (the map in $X \to A$ which is induced by $s$ ) and (the rest of the dependency on $s$ )
The first of these is also connected to another potential flaw with what I said, which is, it seems to describe the alignment of the combination of (the optimizer m uses) along with $g_{m}$ , with $f$ , rather than just the alignment of $g_{m}$ with $f$ .
So, alternatively, one might say something about like, disregarding how the searching behaves and how it selects things that score well at the goal $g_{m}$ , and just compare how $h_{m} (x, σ)$ and $h_{m} (x, σ^{'})$ tend to compare when $σ$ and $σ^{'}$ are generic things which score well under $g_{m}$ and $g_{m}^{'}$ respectively, rather than using the specific procedure that $m$ uses to find something which scores well under $g_{m}$ , and this should also, I think, address the issue of $m$ possibly not having a cleanly separable “how it optimizes for it” method that works for generic “what it optimizes for”.
The second issue, I suspect to not really be a big problem? If we are designing the outer-optimizer, then presumably we understand how it is evaluating things, and understand how that uses the choices of $s (x) : A$ for different $x : X$ .
I may have substantially misunderstood your point?
Or, was your point that the original thing didn’t lay these things out plainly, and that it should have?
Ok, reading more carefully, I see you wrote
I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them.
and the other things right before and after that part, and so I guess something like “it wasn’t stated precisely enough for the cases it is meant to apply to / was presented as applying as a concept more generally than made sense as it was defined” was the point and which I had sorta missed it initially.
(I have no expertise in these matters; unless shown otherwise, assume that in this comment I don’t know what I’m talking about.)