This will be a fairly important post. Not one of those obscure result-packed posts, but something a bit more fundamental that I hope to refer back to many times in the future. It’s at least worth your time to read this first section up to its last paragraph.
There are quite a few places where randomization would help in designing an agent. Maybe we want to find an interpolation between an agent picking the best result, and an agent mimicking the distribution over what a human would do. Maybe we want the agent to do some random exploration in an environment. Maybe we want an agent to randomize amongst promising plans instead of committing fully to the plan it thinks is the best.
However, all of these run into the standard objection that any behavior like this, where a randomized action is the best thing to do, is unstable as the agent gets smarter and has the ability to rewrite itself. If an agent is randomizing to sometimes take actions that aren’ t optimal according to its utility function, then there will be an incentive for the agent to self-modify to eliminate its randomization into those suboptimal actions.
The formalization of this is the following proposition.
Proposition 1:Given some compact metric space of options X, if U:X→R is a bounded function, {μ|∀ν∈ΔX:μ(U)≥ν(U)}=Δ{x|∀y∈X:U(x)≥U(y)}
Intuitively, what this is saying is that the only possible way for a mixture of options to be an optimal move is if each component option is an optimal move. So, utility functions can only give you randomization behavior if the randomization is between optimal actions. The set of such will typically only contain a single point. And so, in general, for any utility function at all, an agent using it will experience a convergent pressure towards deterministic decision-making.
Every single clever alignment trick involving an agent behaving randomly or sampling from a distribution is thereby guaranteed to fail, as it’s not stable under reflection as the agent gets smarter, for anything worthy of being called an agent (in the sense that it has an implicit utility function and acts to achieve it).
The rest of this post will be about how the above sentence is false. There’s a mathematically principled, reflectively stable, way an agent can be, where randomization behavior persists. No matter how smart it gets, it won’t want to remove its randomization behavior. Reflectively stable quantilizers are back on the menu, as are reflectively stable human-imitators, reflectively stable Thompson samplers, and more.
What’s an Infrafunction?
Intuitively, just as infradistributions are a generalization of probability distributions, infrafunctions are a generalization of functions/random variables in the same direction. The next paragraph will be informal (and somewhat wrong) to not clog it with caveats.
The Fundamental Theorem of Inframeasures says that there are two ways of viewing inframeasures. The first way is to view an inframeasure as a closed convex set of measures, where the worst-case measure is picked. You don’t know what distribution in the set will be picked by reality, and so you model it as an adversarial process and plan for the worst-case. As for the second way to view an inframeasure, the thing you do with probability distributions is to take expectations of functions with them. For instance, the probability of an event is just the expected value of the function that’s 1 if the event happens, and 0 otherwise. So an inframeasure may also be viewed as a functional that takes a function X→R as an input, and outputs the expectation value, and which must fulfill some weak additional properties like concavity. Measures fulfill the much stronger property of inducing a linear function (X→R)→R.
Moving away from that, it’s important to note that the vector space of continuous functions X→R, and the vector space of (finite signed) measures on X (denoted M±(X)), are dual to each other. A function f and a measure m are combined to get an expectation value. f,m↦∫fdm (ie, taking the expectation) is the special function of type (X→R)×M±(X)→R. Every continuous linear function (X→R)→R corresponds to taking expectations with respect to some finite signed measure, and every continuous linear function M±(X)→R corresponds to taking expectations with respect to some continuous function X→R.
Since the situation is so symmetric, what happens if we just take all the mathematical machinery of the Fundamental Theorem of Inframeasures, but swap measures and functions around? Well… compare the next paragraph against the earlier paragraph about the Fundamental Theorem of Inframeasures.
The Fundamental Theorem of Infrafunctions says that there are two ways of viewing infrafunctions. The first way is to view an infrafunction as a closed convex set of functions X→R, where the worst-case function is picked. You don’t know what function in the set will be picked by reality, and so you model it as an adversarial process and plan for the worst-case. As for the second way to view an infrafunction, a thing you do with functions to R is you combine them with a probability distribution to get an expectation value. So an infrafunction may also be viewed as a function that takes a probability distribution as an input, and outputs the expectation value, and which must fulfill some weak additional properties like concavity. Functions fulfill the much stronger property of inducing a linear function M±(X)→R.
For the following theorem, a set S of functions X→R is called upper-complete if, whenever f∈S and g≥f, g∈S as well. And a function g will be called minimal in S if f≤g and f∈S implies that f=g.
Theorem 1: Fundamental Theorem of InfrafunctionsIf X is a compact metric space, there is a bijection between concave upper-semicontinuous functions of type ΔX→R∪{−∞}, and closed convex upper-complete sets of continuous functions X→R.
Conjecture 1: Continuity=CompactnessIf X is a compact metric space, there is a bijection between concave continuous functions of type ΔX→R, and closed convex upper-complete sets of continuous functions X→R where the subset of minimal functions has compact closure.
Ok, so, infrafunctions can alternately be viewed as concave (hill-like) functions ΔX→R, or closed convex upwards-complete sets of continuous functions.
Effectively, this is saying that any concave scoring function on the space of probability distributions (like the negative KL-divergence) can equivalently be viewed as a worst-case process that adversarially selects functions. For any way of scoring distributions over “what to do” where randomization ends up being the highest-scoring option, the scoring function on ΔX is probably going to curve down (be concave), and so it’ll implicitly be optimizing for the worst-case score amongst a set of functions X→R.
Now, going from a desired randomization behavior, to a concave scoring function with that desired behavior as the optimum, to the set of functions the scoring function is implicitly worst-casing over, takes nontrivial mathematical effort. There’s a whole bunch of possible randomization behaviors for which I don’t know what sort of worst-case beliefs induce that sort of randomization.
For instance, what sort of uncertainty around a utility function makes an agent softmax over it? I don’t know. Although, to be fair, I haven’t tried particularly hard.
Generalization of Quantilization
However, one example that can be worked out (and has already been worked out by Jessica Taylor and Vanessa Kosoy) is quantilization.
If you haven’t seen quantilization before, it’s where you take a reference distribution over actions, ν∈ΔX, and instead of picking the best action to optimize the function U, you condition ν on the event “the action I picked is in the top 1 percent of actions sampled from ν, in terms of how much U likes it” and sample from that distribution.
I mean, it doesn’t have to be 1 percent specifically, but this general process of “take a reference distribution, plot out how good it is w.r.t. some function U, condition on the U score being beyond some threshold, and sample from that updated distribution” is referred to as quantilization.
Well, if that’s the sort of optimization behavior we’re looking for, we might ask “what sort of concave scoring function on probability distributions has quantilization as the optimum?”, or “what sort of utility function uncertainty produces that scoring function?”
As it turns out, quantilization w.r.t. a reference distribution ν, where your utility function is U, corresponds to worst-casing amongst the following set of functions for some ϵ.
{V|∫|U−V|dν≤ϵ}
The epistemic state that leads to quantilization is therefore one where you think that your utility function is unfixably corrupted, but that its deviation from the true utility function is low relative to a given reference distribution ν. Specifically, if U is your utility function you see, and V is the true utility function which U is a corruption of, you believe that ∫|U−V|dν≤ϵ and have no other beliefs about the true utility function V.
You’d be very wary about going off-distribution because you’d think “the corruption can be arbitrarily bad off-distribution because all I believe is that ∫|U−V|dν is low, so in a region where ν is low-probability, it’s possible that V is super-low there”. You’d also be very wary about deterministically picking a single spot where U is high, because maybe the corruption is concentrated on those sparse few spots where U is the highest.
However, if you randomize uniformly amongst the top quantile of ν where U is high, this is actually the optimal response to this sort of utility function corruption. No matter how the corruption is set up (as long as it’s small relative to ν), the quantilizing policy is unlikely to do poorly w.r.t (the agent’s irreducibly uncertain beliefs about) the true utility function.
These are some rather big claims. Exactly as stated above, they’ve already been proven a long time ago by people who aren’t me. However, there’s a further-reaching generalization that the above results are a special case of, which allows interpolating between quantilization and maximization, which is novel.
Introduction to Lp Spaces (skippable)
If you already know what Lp spaces are, you can skip straight to the general theorem, but if you don’t, time to take a detour and explain them.
Given some nice space X and probability distribution ν∈ΔX, we can make the vector space of “measurable functions X→R which are equivalent w.r.t. ν”. After all, you can add functions together, and multiply by constants, so it makes a vector space. However, the elements of this vector space aren’t quite functions, they’re equivalence classes of functions that are equivalent w.r.t. ν. Ie, if f and g are two different functions, but ν has zero probability of selecting a point x where f(x)≠g(x), then f and g will be the same point in the vector space we’re constructing.
This vector space can be equipped with a norm. Actually, it can be equipped with a lot of norms. One for each real number p in [1,∞]. The Lp norm on the space of “functions that are equivalent w.r.t. ν” is:
||f||p:=(∫|f|pdν)1p
For the L2 norm, it’d be
||f||2:=√∫f2dν
Compare this to euclidean distance!
||x||=
⎷n∑i=1x2i
So, our set-of-functions which induces quantilization behavior,
{V|∫|U−V|dν≤ϵ}
Can also be expressed as “an ϵ-sized ball around U w.r.t. the L1 norm and ν”, and this is the core of how to generalize further, for we may ask, what’s so special about the L1 norm? What about the Lp norm for all the p∈[1,∞]? What do those do?
Theorem 2: Lp Ball Theorem
For any ϵ>0, p∈[1,∞], f:X→R, and ν:ΔX, the infrafunction corresponding to {g|(∫|f−g|pdν)1p≤ϵ} (Knightian uncertainty over the Lp ball of size ϵ centered at f, w.r.t ν), is μ↦μ(f)−ϵ||dμdν||q where 1p+1q=1. Further, given a function f, the optimal μ to pick is the distribution a⋅ν⋅max(f−b,0)p−1, for some constants a>0 and b≤max(f).
Some special cases of this are as follows. For p=1, you get quantilization. Worst-casing over little L1 balls means that your task is to pick the probability distribution μ which maximizes μ(f)−ϵ||dμdν||∞, and this maximizing probability distribution is ν rescaled by the function that’s 1 when f exceeds the threshold value and 0 otherwise (as this is the limit of max(f−b,0)δ as δ→0) This can be restated as conditioning ν on the event that f exceeds a certain threshold, and so we get quantilizers.
For p=2, you get a more aggressive sort of optimization. Worst-casing over little L2 balls means that your task is to pick the probability distribution μ which maximizes μ(f)−ϵ||dμdν||2, and this maximizing probability distribution is ν but rescaled linearly with how much f exceeds the threshold value b. So, for example, given two points x and y, if f(x)−b=3(f(y)−b), the probability density at x is enhanced by a factor of 3 over what it’d be at y.
For p=∞, you basically just get argmax over the support of the distribution ν. Worst-casing over little L∞ balls means that your task is to pick the probability distribution μ which maximizes μ(f)−ϵ||dμdν||1, and this maximizing probability distribution is ν but rescaled according to max(f−b,0)n for arbitrarily large n. Ie, incredibly large at the highest value of f, which dominates the other values. So pretty much, a dirac-delta distribution at the best point in the support of ν.
There’s a massive amount of work to be done on how various sorts of randomization behavior from agents relate to various sorts of concave scoring rules for distributions, and how those relate with various sorts of (restricted) worst-case assumptions about how the utility function got corrupted.
Dynamic Consistency??
But what’s this about agents with infrafunctions U as their utility functions being stable under reflection? Well, we’d want an agent to be incentivized to keep its utility function the same (or at least not change it in unpredictable ways) no matter what it sees. Making this more precise, if an agent has a utility (infra)function U, then it should believe that optimizing for the infrafunction U|h (U but modified in a predictable way to account for having seen history h) after seeing history h, will produce equal or better results (according to U) than optimizing for any competitor (infra)function V after seeing h. This is a necessary condition to give the starting agent no incentive to alter the utility function of its future self in an unpredictable way (ie, alter it in a way that differs from U|h).
For example, if an agent with an infrafunction U ever ends up thinking “If I was an optimizer for the utility function (not infrafunction!) V, I’d do better, hang on, lemme just rewrite myself to optimize that instead”, that would be instability under reflection. That just should not ever happen. Infrafunctions shouldn’t collapse into utility functions.
And, as it turns out, if you’ve got an agent operating an an environment with a utility infrafunction, there is a way to update the infrafunction over time which makes this happen. The agent won’t want to change its infrafunction in any way other than by updating it. However, the update shares an undesirable property with the dynamically consistent way of updating infradistributions, though. Specifically, the way to update a utility infrafunction (after you’ve seen a history) depends on what the agent’s policy would do in other branches.
If you’re wondering why the heck we need to update our utility infrafunction over time, and why updating would require knowing what happens in alternate timelines, here’s why. The agent is optimizing worst-case expected value of the functions within its set-of-functions. Thus, the agent will tend to focus its marginal efforts on optimizing for the utility functions in its set which have the lowest expected value, in ways that don’t destroy too much value for the utility functions which are already doing well in expectation. And so, for a given function f∈^U (the set induced by the infrafunction U), it matters very much whether the agent is doing quite well according to f in alternate branches (it’s doing well in expectation so it’s safe to mostly ignore it in this branch), or whether the agent is scoring horribly according to f in the alternate branches (which means that it needs to be optimized in this branch).
Time to introduce the notation to express how the update works. If h is a finite history, and π¬h is a stochastic partial policy that tells the agent what to do in all situations except where the history has h as a prefix, and π∗ is a stochastic partial policy that tells the agent what to do in all situations where the history has h as a prefix, then π∗∙π¬h is the overall policy made by gluing together those two partial policies.
Also, if e is an environment, and π is a stochastic policy, π⋅e refers to the distribution over infinite histories produced by the policy interacting with the environment. By abuse of notation, π¬h⋅e can be interpreted as a probability distribution on
{h′∈(A×O)ω|h⋢h′}∪{h}
This is because, obeying π¬h, everything either works just fine and π¬h keeps telling you what your next action is and you build some infinite history h′, or the partial history h happens and π¬h stops telling you what to do.
The notation 1¬h is the indicator function that’s 1 when the full history lacks h as a prefix, and 0 on the partial history h. U↓h is the function U, but with a restricted domain so it’s only defined on infinite bitstrings with h as a prefix.
With those notations out of the way, given an infrafunction U (and using U for the utility functions in the corresponding set ^U), we can finally say how to define the updated form of U, where we’re updating on some arbitrary history h, environment e, and off-history partial stochastic policy π¬h which tells us how we act for all histories that lack h as a prefix.
Definition 1: Infrafunction UpdateFor an infrafunction U of type Δ(A×O)ω→R, history h, environment e, and partial stochastic policy π¬h which specifies all aspects of how the agent behaves except after history h, U|e,h,π¬h, the update of the infrafunction, is the infrafunction corresponding to the set of functions{(π¬h⋅e)(1h)⋅U↓h+(π¬h⋅e)(1¬hU)|U∈^U}Or, restated,{Eh′∼π¬h⋅e[1h]⋅U↓h+Eh′∼π¬h⋅e[1¬hU]|U∈^U}
So, basically, what this is doing is it’s taking all the component functions, restricting them to just be about what happens beyond the partial history h, scaling them down, and taking the behavior of those functions off-h to determine what constant to add to the new function. So, functions which do well off-h get a larger constant added to them than functions which do poorly off-h.
And now we get to our theorem, that if you update infrafunctions in this way, it’s always better (from the perspective of the start of time) to optimize for the updated infrafunction than to go off and rewrite yourself to optimize for something else.
Theorem 3: Dynamic ConsistencyFor any environment e, finite history h, off-history policy π¬h, and infrafunctions U and V, we have thatU((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e)≥U((π¬h∙argmaxπ∗V(π∗⋅(e|h)))⋅e)Or, restating in words, selecting the after-h policy by argmaxing for U|e,h,π¬h makes an overall policy that outscores the policy you get by selecting the after-h policy to argmax for V.
Ok, but what does this sort of update mean in practice? Well, intuitively, if you’re optimizing according to an infrafunction, and some of the component functions you’re worst-casing over are sufficiently well-satisfied in other branches, they kind of “drop out”. We’re optimizing the worst-case, so the functions that are doing pretty well elsewhere can be ignored as long as you’re not acting disastrously with respect to them. You’re willing to take a hit according to those functions, in order to do well according to the functions that aren’t being well-satisfied in other branches.
Why Worst Case?
Worst case seems a bit sketchy. Aren’t there more sane things to do like, have a probability distribution on utility functions, and combine them according to geometric average? That’s what Nash Bargaining does to aggregate a bunch of utility functions into one! Scott Garrabrant wrote an entire sequence about that sort of stuff!
Well, guess what, it fits in the infrafunction framework. Geometric averaging of utility functions ends up being writeable as an infrafunction! (But I don’t know what it corresponds to worst-casing over). First up, a handy little result.
Proposition 2: Lp Double Integral InequalityIf f≥0, let ∮pfdμ be an abbreviation for (∫fpdμ)1p. Then for all p∈[−∞,1], and μ:ΔX and ν:ΔY and f:X×Y→R≥0, we have that∫∮pf(x,y)dνdμ≤∮p∫f(x,y)dμdν
Corollary 1: Lp-Averaging is Well-DefinedGiven any distribution ν over a family of functions or infrafunctions Fi≥0, define the Lp-average of this family (for p∈[−∞,1]) as the function
μ↦∮pFi(μ)dνLp-averaging always produces an infrafunction.
Corollary 2: Geometric Mean Makes InfrafunctionsThe geometric mean of a distribution ν over utility functions is an infrafunction.
Proof:
The geometric mean of a distribution over utility functions Ui is the function
μ↦e∫ln(Ui(μ))dν
However, the geometric mean is the same as the L0 integral. So we get that it’s actually writeable as
μ↦∮0Ui(μ)dν
And we can apply Corollary 2 to get that it’s an infrafunction.
So, this Lp-mixing, for p∈[−∞,1] is… well, for 1, it’s just usual mixing. For 0, it’s taking the geometric average of the functions. For −∞, it’s taking the minimum of all the functions. So, it provides a nice way to interpolate between minimization, geometric averaging, and arithmetic averaging, and all these ways of aggregating functions produce infrafunctions.
Just don’t ask me what utility functions are actually in the infrafunction corresponding to a geometric mixture.
The Crappy Optimizer Theorem
Technically this theorem doesn’t actually belong in this post. But it’s close enough to the subject matter of this post to throw it in anyways. It turns out that “every” (not really) vaguely optimizer-ish process can be reexpressed as some sort of ultradistribution. An ultradistribution is basically an infradistribution (a closed convex set of probability distributions), except it maximizes functions instead of minimizing them.
And so, “every” (not really) optimizer-y process can be thought of as just argmax operating over a more restricted set of probability distributions.
Try not to read too much into the Crappy Optimizer Theorem, I’d very strongly advise that you take your favorite non-argmax process and work out how it violates the assumptions of the theorem. Hopefully that’ll stop you from thinking this theorem is the final word on optimization processes.
Anyways, let’s discuss this. The type signature of argmax is (X→R)→X. Let’s say we’re looking for some new sort of optimizer that isn’t argmax. We want a function s of the same type signature, that “doesn’t try as hard”.
We don’t actually know what s is! It could be anything. However, there’s a function Q:((X→R)→X)→((X→R)→R), which I’ll call the “score-shift” function, and it’s defined as follows.
s↦(f↦f(s(f)))
Basically, given an optimizer and a function, you run the optimizer on the function to get a good input, and shove that input through the function to get a score. As a concrete example, Q(argmax)=max. If you have a function f, argmax over it, and plug the result of argmax back into f, that’s the same as taking the function f and producing a score of max(f).
So, instead of studying the magical optimizer black box s, we’ll be studying Q(s) instead, and characterizing the optimization process by what scores it attains on various functions. There are four properties in particular, which it seems like any good optimizer should fulfill.
1: c-additivity
For any constant function c and function f, Q(s)(f+c)=Q(s)(f)+c.
2: Homogenity
For any a≥0 and function f, Q(s)(a⋅f)=a⋅Q(s)(f).
3: Subadditivity
For any functions f,g, Q(s)(f+g)≤Q(s)(f)+Q(s)(g).
4: Zero bound
For any function f≤0, Q(s)(f)≤0.
Rephrasing this, though we don’t know what the optimization-y process sis, it’s quite plausible that it’ll fulfill the following four properties.
1: If you add a constant to the input function, you’ll get that constant added to your score.
2: If you rescale the input function, it rescales the score the optimization process attains.
3: Optimizing the sum of two functions does worse than optimizing them separately and adding your best scores together.
4: Optimizing a function that’s never positive can’t produce a positive score.
Exercise: Which of these properties does softmax break? Which of these properties does gradient ascent with infinitesimal step-size break?
Also, notice that the first two properties combined are effectively saying “if you try to optimize a utility function, the optimization process will ignore scales and shifts in that utility function”.
Theorem 4: Crappy Optimizer TheoremFor any selection process s where Q(s) fulfills the four properties above, ∀f:Q(s)(f)=maxμ∈Ψμ(f) will hold for some closed convex set of probability distributions Ψ. Conversely, the function f↦maxμ∈Ψμ(f) for any closed convex set Ψ will fulfill the four properties of an optimization process.
Informal Corollary 3:Any selection process s where Q(s) fulfills the four properties is effectively just argmax but using a restricted set of probability distributions.
Other Aspects of Infrafunctions
Infrafunctions are the analogue of random variables in inframeasure theory. Here are two useful properties of them.
First off, in the classical case of functions, we can average functions together. There’s a distinguished averaging function Δ(X→R)→(X→R). Pretty obvious.
When we go to infrafunctions, this gets extended somewhat. There’s a distinguished function □FX→FX, where □ is the space of infradistributions, and FX is the space of infrafunctions. If you know enough category theory, we can say that the space of infrafunctions is a □-algebra, where □ is the infradistribution monad. If you don’t know enough category theory, basically, there’s an infra-version of “averaging points together” and it makes all the diagrams commute really nicely.
Proposition 3:The space of infrafunctions is a □-algebra, with the function flatFX:□FX→FX being defined as λψ.λμ.ψ(λF.F(μ)).
Also, in the classical case, if you’ve got a function g:X→Y, that produces a function (Y→R)→(X→R) in a pretty obvious way. Just precompose. f↦(x↦f(g(x))).
A similar thing holds here. Except, in this case, instead of a function, we can generalize further to a continuous infrakernel k:X→□Y. Again, to get a function FY→FX, you just precompose. G↦(μ↦G(k∗(μ)). Take the distribution μ, shove it through the infrakernel k to get an infradistribution on Y, and shove that through the infrafunction G.
Proposition 4:All continuous infrakernels k:X→□Y induce a function FY→FX via G↦(μ↦G(k∗(μ))
So, given an infrakernel (and functions are a special case of this) going one way, you can transfer infrafunctions backwards from one space to the other.
The restriction to continuous infrakernels, though, actually does matter here. A function f:X→Y induces two infrakernels, one of type X→□Y (the image), and one of type Y→□X (the preimage). So, theoretically we could get a function of type (X→Y)→(FX→FY) by routing through the preimage function. However, since it requires continuity of the function Y→□X to make things work out, you can only reverse the direction if the function mapping a point x to its preimage is Hausdorff-continuous. So, for functions with Hausdorff-continuous inverses, you can flip the usual direction and go (X→Y)→(FX→FY). But this trick doesn’t work in general, only (X→□Y)→(FY→FX) is valid in general.
There’s a bunch of other things you can do, like intersection of infrafunctions making a new infrafunction, and union making a new infrafunction. Really, most of the same stuff as works with infradistributions.
The field is wide-open. But it’s a single framework that can accomodate Scott’s generalized epistemic states, Scott’s geometric averaging, Knightian uncertainty, intersecting and unioning that uncertainty, averaging, quantilizers, dynamic consistency, worst-case reasoning, and approximate maximization. So it seems quite promising for future use.
Infrafunctions and Robust Optimization
Proofs are in this link
This will be a fairly important post. Not one of those obscure result-packed posts, but something a bit more fundamental that I hope to refer back to many times in the future. It’s at least worth your time to read this first section up to its last paragraph.
There are quite a few places where randomization would help in designing an agent. Maybe we want to find an interpolation between an agent picking the best result, and an agent mimicking the distribution over what a human would do. Maybe we want the agent to do some random exploration in an environment. Maybe we want an agent to randomize amongst promising plans instead of committing fully to the plan it thinks is the best.
However, all of these run into the standard objection that any behavior like this, where a randomized action is the best thing to do, is unstable as the agent gets smarter and has the ability to rewrite itself. If an agent is randomizing to sometimes take actions that aren’ t optimal according to its utility function, then there will be an incentive for the agent to self-modify to eliminate its randomization into those suboptimal actions.
The formalization of this is the following proposition.
Proposition 1: Given some compact metric space of options X, if U:X→R is a bounded function, {μ|∀ν∈ΔX:μ(U)≥ν(U)}=Δ{x|∀y∈X:U(x)≥U(y)}
Intuitively, what this is saying is that the only possible way for a mixture of options to be an optimal move is if each component option is an optimal move. So, utility functions can only give you randomization behavior if the randomization is between optimal actions. The set of such will typically only contain a single point. And so, in general, for any utility function at all, an agent using it will experience a convergent pressure towards deterministic decision-making.
Every single clever alignment trick involving an agent behaving randomly or sampling from a distribution is thereby guaranteed to fail, as it’s not stable under reflection as the agent gets smarter, for anything worthy of being called an agent (in the sense that it has an implicit utility function and acts to achieve it).
The rest of this post will be about how the above sentence is false. There’s a mathematically principled, reflectively stable, way an agent can be, where randomization behavior persists. No matter how smart it gets, it won’t want to remove its randomization behavior. Reflectively stable quantilizers are back on the menu, as are reflectively stable human-imitators, reflectively stable Thompson samplers, and more.
What’s an Infrafunction?
Intuitively, just as infradistributions are a generalization of probability distributions, infrafunctions are a generalization of functions/random variables in the same direction. The next paragraph will be informal (and somewhat wrong) to not clog it with caveats.
The Fundamental Theorem of Inframeasures says that there are two ways of viewing inframeasures. The first way is to view an inframeasure as a closed convex set of measures, where the worst-case measure is picked. You don’t know what distribution in the set will be picked by reality, and so you model it as an adversarial process and plan for the worst-case. As for the second way to view an inframeasure, the thing you do with probability distributions is to take expectations of functions with them. For instance, the probability of an event is just the expected value of the function that’s 1 if the event happens, and 0 otherwise. So an inframeasure may also be viewed as a functional that takes a function X→R as an input, and outputs the expectation value, and which must fulfill some weak additional properties like concavity. Measures fulfill the much stronger property of inducing a linear function (X→R)→R.
Moving away from that, it’s important to note that the vector space of continuous functions X→R, and the vector space of (finite signed) measures on X (denoted M±(X)), are dual to each other. A function f and a measure m are combined to get an expectation value. f,m↦∫fdm (ie, taking the expectation) is the special function of type (X→R)×M±(X)→R. Every continuous linear function (X→R)→R corresponds to taking expectations with respect to some finite signed measure, and every continuous linear function M±(X)→R corresponds to taking expectations with respect to some continuous function X→R.
Since the situation is so symmetric, what happens if we just take all the mathematical machinery of the Fundamental Theorem of Inframeasures, but swap measures and functions around? Well… compare the next paragraph against the earlier paragraph about the Fundamental Theorem of Inframeasures.
The Fundamental Theorem of Infrafunctions says that there are two ways of viewing infrafunctions. The first way is to view an infrafunction as a closed convex set of functions X→R, where the worst-case function is picked. You don’t know what function in the set will be picked by reality, and so you model it as an adversarial process and plan for the worst-case. As for the second way to view an infrafunction, a thing you do with functions to R is you combine them with a probability distribution to get an expectation value. So an infrafunction may also be viewed as a function that takes a probability distribution as an input, and outputs the expectation value, and which must fulfill some weak additional properties like concavity. Functions fulfill the much stronger property of inducing a linear function M±(X)→R.
For the following theorem, a set S of functions X→R is called upper-complete if, whenever f∈S and g≥f, g∈S as well. And a function g will be called minimal in S if f≤g and f∈S implies that f=g.
Theorem 1: Fundamental Theorem of Infrafunctions If X is a compact metric space, there is a bijection between concave upper-semicontinuous functions of type ΔX→R∪{−∞}, and closed convex upper-complete sets of continuous functions X→R.
Conjecture 1: Continuity=Compactness If X is a compact metric space, there is a bijection between concave continuous functions of type ΔX→R, and closed convex upper-complete sets of continuous functions X→R where the subset of minimal functions has compact closure.
Ok, so, infrafunctions can alternately be viewed as concave (hill-like) functions ΔX→R, or closed convex upwards-complete sets of continuous functions.
Effectively, this is saying that any concave scoring function on the space of probability distributions (like the negative KL-divergence) can equivalently be viewed as a worst-case process that adversarially selects functions. For any way of scoring distributions over “what to do” where randomization ends up being the highest-scoring option, the scoring function on ΔX is probably going to curve down (be concave), and so it’ll implicitly be optimizing for the worst-case score amongst a set of functions X→R.
Now, going from a desired randomization behavior, to a concave scoring function with that desired behavior as the optimum, to the set of functions the scoring function is implicitly worst-casing over, takes nontrivial mathematical effort. There’s a whole bunch of possible randomization behaviors for which I don’t know what sort of worst-case beliefs induce that sort of randomization.
For instance, what sort of uncertainty around a utility function makes an agent softmax over it? I don’t know. Although, to be fair, I haven’t tried particularly hard.
Generalization of Quantilization
However, one example that can be worked out (and has already been worked out by Jessica Taylor and Vanessa Kosoy) is quantilization.
If you haven’t seen quantilization before, it’s where you take a reference distribution over actions, ν∈ΔX, and instead of picking the best action to optimize the function U, you condition ν on the event “the action I picked is in the top 1 percent of actions sampled from ν, in terms of how much U likes it” and sample from that distribution.
I mean, it doesn’t have to be 1 percent specifically, but this general process of “take a reference distribution, plot out how good it is w.r.t. some function U, condition on the U score being beyond some threshold, and sample from that updated distribution” is referred to as quantilization.
Well, if that’s the sort of optimization behavior we’re looking for, we might ask “what sort of concave scoring function on probability distributions has quantilization as the optimum?”, or “what sort of utility function uncertainty produces that scoring function?”
As it turns out, quantilization w.r.t. a reference distribution ν, where your utility function is U, corresponds to worst-casing amongst the following set of functions for some ϵ. {V|∫|U−V|dν≤ϵ} The epistemic state that leads to quantilization is therefore one where you think that your utility function is unfixably corrupted, but that its deviation from the true utility function is low relative to a given reference distribution ν. Specifically, if U is your utility function you see, and V is the true utility function which U is a corruption of, you believe that ∫|U−V|dν≤ϵ and have no other beliefs about the true utility function V.
You’d be very wary about going off-distribution because you’d think “the corruption can be arbitrarily bad off-distribution because all I believe is that ∫|U−V|dν is low, so in a region where ν is low-probability, it’s possible that V is super-low there”. You’d also be very wary about deterministically picking a single spot where U is high, because maybe the corruption is concentrated on those sparse few spots where U is the highest.
However, if you randomize uniformly amongst the top quantile of ν where U is high, this is actually the optimal response to this sort of utility function corruption. No matter how the corruption is set up (as long as it’s small relative to ν), the quantilizing policy is unlikely to do poorly w.r.t (the agent’s irreducibly uncertain beliefs about) the true utility function.
These are some rather big claims. Exactly as stated above, they’ve already been proven a long time ago by people who aren’t me. However, there’s a further-reaching generalization that the above results are a special case of, which allows interpolating between quantilization and maximization, which is novel.
Introduction to Lp Spaces (skippable)
If you already know what Lp spaces are, you can skip straight to the general theorem, but if you don’t, time to take a detour and explain them.
Given some nice space X and probability distribution ν∈ΔX, we can make the vector space of “measurable functions X→R which are equivalent w.r.t. ν”. After all, you can add functions together, and multiply by constants, so it makes a vector space. However, the elements of this vector space aren’t quite functions, they’re equivalence classes of functions that are equivalent w.r.t. ν. Ie, if f and g are two different functions, but ν has zero probability of selecting a point x where f(x)≠g(x), then f and g will be the same point in the vector space we’re constructing.
This vector space can be equipped with a norm. Actually, it can be equipped with a lot of norms. One for each real number p in [1,∞]. The Lp norm on the space of “functions that are equivalent w.r.t. ν” is: ||f||p:=(∫|f|pdν)1p For the L2 norm, it’d be ||f||2:=√∫f2dν Compare this to euclidean distance! ||x||= ⎷n∑i=1x2i So, our set-of-functions which induces quantilization behavior, {V|∫|U−V|dν≤ϵ} Can also be expressed as “an ϵ-sized ball around U w.r.t. the L1 norm and ν”, and this is the core of how to generalize further, for we may ask, what’s so special about the L1 norm? What about the Lp norm for all the p∈[1,∞]? What do those do?
Theorem 2: Lp Ball Theorem
For any ϵ>0, p∈[1,∞], f:X→R, and ν:ΔX, the infrafunction corresponding to {g|(∫|f−g|pdν)1p≤ϵ} (Knightian uncertainty over the Lp ball of size ϵ centered at f, w.r.t ν), is μ↦μ(f)−ϵ||dμdν||q where 1p+1q=1. Further, given a function f, the optimal μ to pick is the distribution a⋅ν⋅max(f−b,0)p−1, for some constants a>0 and b≤max(f).
Some special cases of this are as follows. For p=1, you get quantilization. Worst-casing over little L1 balls means that your task is to pick the probability distribution μ which maximizes μ(f)−ϵ||dμdν||∞, and this maximizing probability distribution is ν rescaled by the function that’s 1 when f exceeds the threshold value and 0 otherwise (as this is the limit of max(f−b,0)δ as δ→0) This can be restated as conditioning ν on the event that f exceeds a certain threshold, and so we get quantilizers.
For p=2, you get a more aggressive sort of optimization. Worst-casing over little L2 balls means that your task is to pick the probability distribution μ which maximizes μ(f)−ϵ||dμdν||2, and this maximizing probability distribution is ν but rescaled linearly with how much f exceeds the threshold value b. So, for example, given two points x and y, if f(x)−b=3(f(y)−b), the probability density at x is enhanced by a factor of 3 over what it’d be at y.
For p=∞, you basically just get argmax over the support of the distribution ν. Worst-casing over little L∞ balls means that your task is to pick the probability distribution μ which maximizes μ(f)−ϵ||dμdν||1, and this maximizing probability distribution is ν but rescaled according to max(f−b,0)n for arbitrarily large n. Ie, incredibly large at the highest value of f, which dominates the other values. So pretty much, a dirac-delta distribution at the best point in the support of ν.
There’s a massive amount of work to be done on how various sorts of randomization behavior from agents relate to various sorts of concave scoring rules for distributions, and how those relate with various sorts of (restricted) worst-case assumptions about how the utility function got corrupted.
Dynamic Consistency??
But what’s this about agents with infrafunctions U as their utility functions being stable under reflection? Well, we’d want an agent to be incentivized to keep its utility function the same (or at least not change it in unpredictable ways) no matter what it sees. Making this more precise, if an agent has a utility (infra)function U, then it should believe that optimizing for the infrafunction U|h (U but modified in a predictable way to account for having seen history h) after seeing history h, will produce equal or better results (according to U) than optimizing for any competitor (infra)function V after seeing h. This is a necessary condition to give the starting agent no incentive to alter the utility function of its future self in an unpredictable way (ie, alter it in a way that differs from U|h).
For example, if an agent with an infrafunction U ever ends up thinking “If I was an optimizer for the utility function (not infrafunction!) V, I’d do better, hang on, lemme just rewrite myself to optimize that instead”, that would be instability under reflection. That just should not ever happen. Infrafunctions shouldn’t collapse into utility functions.
And, as it turns out, if you’ve got an agent operating an an environment with a utility infrafunction, there is a way to update the infrafunction over time which makes this happen. The agent won’t want to change its infrafunction in any way other than by updating it. However, the update shares an undesirable property with the dynamically consistent way of updating infradistributions, though. Specifically, the way to update a utility infrafunction (after you’ve seen a history) depends on what the agent’s policy would do in other branches.
If you’re wondering why the heck we need to update our utility infrafunction over time, and why updating would require knowing what happens in alternate timelines, here’s why. The agent is optimizing worst-case expected value of the functions within its set-of-functions. Thus, the agent will tend to focus its marginal efforts on optimizing for the utility functions in its set which have the lowest expected value, in ways that don’t destroy too much value for the utility functions which are already doing well in expectation. And so, for a given function f∈^U (the set induced by the infrafunction U), it matters very much whether the agent is doing quite well according to f in alternate branches (it’s doing well in expectation so it’s safe to mostly ignore it in this branch), or whether the agent is scoring horribly according to f in the alternate branches (which means that it needs to be optimized in this branch).
Time to introduce the notation to express how the update works. If h is a finite history, and π¬h is a stochastic partial policy that tells the agent what to do in all situations except where the history has h as a prefix, and π∗ is a stochastic partial policy that tells the agent what to do in all situations where the history has h as a prefix, then π∗∙π¬h is the overall policy made by gluing together those two partial policies.
Also, if e is an environment, and π is a stochastic policy, π⋅e refers to the distribution over infinite histories produced by the policy interacting with the environment. By abuse of notation, π¬h⋅e can be interpreted as a probability distribution on {h′∈(A×O)ω|h⋢h′}∪{h} This is because, obeying π¬h, everything either works just fine and π¬h keeps telling you what your next action is and you build some infinite history h′, or the partial history h happens and π¬h stops telling you what to do.
The notation 1¬h is the indicator function that’s 1 when the full history lacks h as a prefix, and 0 on the partial history h. U↓h is the function U, but with a restricted domain so it’s only defined on infinite bitstrings with h as a prefix.
With those notations out of the way, given an infrafunction U (and using U for the utility functions in the corresponding set ^U), we can finally say how to define the updated form of U, where we’re updating on some arbitrary history h, environment e, and off-history partial stochastic policy π¬h which tells us how we act for all histories that lack h as a prefix.
Definition 1: Infrafunction Update For an infrafunction U of type Δ(A×O)ω→R, history h, environment e, and partial stochastic policy π¬h which specifies all aspects of how the agent behaves except after history h, U|e,h,π¬h, the update of the infrafunction, is the infrafunction corresponding to the set of functions {(π¬h⋅e)(1h)⋅U↓h+(π¬h⋅e)(1¬hU)|U∈^U} Or, restated, {Eh′∼π¬h⋅e[1h]⋅U↓h+Eh′∼π¬h⋅e[1¬hU]|U∈^U}
So, basically, what this is doing is it’s taking all the component functions, restricting them to just be about what happens beyond the partial history h, scaling them down, and taking the behavior of those functions off-h to determine what constant to add to the new function. So, functions which do well off-h get a larger constant added to them than functions which do poorly off-h.
And now we get to our theorem, that if you update infrafunctions in this way, it’s always better (from the perspective of the start of time) to optimize for the updated infrafunction than to go off and rewrite yourself to optimize for something else.
Theorem 3: Dynamic Consistency For any environment e, finite history h, off-history policy π¬h, and infrafunctions U and V, we have that U((π¬h∙argmaxπ∗(U|e,h,π¬h)(π∗⋅(e|h)))⋅e)≥U((π¬h∙argmaxπ∗V(π∗⋅(e|h)))⋅e) Or, restating in words, selecting the after-h policy by argmaxing for U|e,h,π¬h makes an overall policy that outscores the policy you get by selecting the after-h policy to argmax for V.
Ok, but what does this sort of update mean in practice? Well, intuitively, if you’re optimizing according to an infrafunction, and some of the component functions you’re worst-casing over are sufficiently well-satisfied in other branches, they kind of “drop out”. We’re optimizing the worst-case, so the functions that are doing pretty well elsewhere can be ignored as long as you’re not acting disastrously with respect to them. You’re willing to take a hit according to those functions, in order to do well according to the functions that aren’t being well-satisfied in other branches.
Why Worst Case?
Worst case seems a bit sketchy. Aren’t there more sane things to do like, have a probability distribution on utility functions, and combine them according to geometric average? That’s what Nash Bargaining does to aggregate a bunch of utility functions into one! Scott Garrabrant wrote an entire sequence about that sort of stuff!
Well, guess what, it fits in the infrafunction framework. Geometric averaging of utility functions ends up being writeable as an infrafunction! (But I don’t know what it corresponds to worst-casing over). First up, a handy little result.
Proposition 2: Lp Double Integral Inequality If f≥0, let ∮pfdμ be an abbreviation for (∫fpdμ)1p. Then for all p∈[−∞,1], and μ:ΔX and ν:ΔY and f:X×Y→R≥0, we have that ∫∮pf(x,y)dνdμ≤∮p∫f(x,y)dμdν
Corollary 1: Lp-Averaging is Well-Defined Given any distribution ν over a family of functions or infrafunctions Fi≥0, define the Lp-average of this family (for p∈[−∞,1]) as the function μ↦∮pFi(μ)dν Lp-averaging always produces an infrafunction.
Corollary 2: Geometric Mean Makes Infrafunctions The geometric mean of a distribution ν over utility functions is an infrafunction.
Proof: The geometric mean of a distribution over utility functions Ui is the function μ↦e∫ln(Ui(μ))dν However, the geometric mean is the same as the L0 integral. So we get that it’s actually writeable as μ↦∮0Ui(μ)dν And we can apply Corollary 2 to get that it’s an infrafunction.
So, this Lp-mixing, for p∈[−∞,1] is… well, for 1, it’s just usual mixing. For 0, it’s taking the geometric average of the functions. For −∞, it’s taking the minimum of all the functions. So, it provides a nice way to interpolate between minimization, geometric averaging, and arithmetic averaging, and all these ways of aggregating functions produce infrafunctions.
Just don’t ask me what utility functions are actually in the infrafunction corresponding to a geometric mixture.
The Crappy Optimizer Theorem
Technically this theorem doesn’t actually belong in this post. But it’s close enough to the subject matter of this post to throw it in anyways. It turns out that “every” (not really) vaguely optimizer-ish process can be reexpressed as some sort of ultradistribution. An ultradistribution is basically an infradistribution (a closed convex set of probability distributions), except it maximizes functions instead of minimizing them.
And so, “every” (not really) optimizer-y process can be thought of as just argmax operating over a more restricted set of probability distributions.
Try not to read too much into the Crappy Optimizer Theorem, I’d very strongly advise that you take your favorite non-argmax process and work out how it violates the assumptions of the theorem. Hopefully that’ll stop you from thinking this theorem is the final word on optimization processes.
Anyways, let’s discuss this. The type signature of argmax is (X→R)→X. Let’s say we’re looking for some new sort of optimizer that isn’t argmax. We want a function s of the same type signature, that “doesn’t try as hard”.
We don’t actually know what s is! It could be anything. However, there’s a function Q:((X→R)→X)→((X→R)→R), which I’ll call the “score-shift” function, and it’s defined as follows. s↦(f↦f(s(f))) Basically, given an optimizer and a function, you run the optimizer on the function to get a good input, and shove that input through the function to get a score. As a concrete example, Q(argmax)=max. If you have a function f, argmax over it, and plug the result of argmax back into f, that’s the same as taking the function f and producing a score of max(f).
So, instead of studying the magical optimizer black box s, we’ll be studying Q(s) instead, and characterizing the optimization process by what scores it attains on various functions. There are four properties in particular, which it seems like any good optimizer should fulfill.
1: c-additivity For any constant function c and function f, Q(s)(f+c)=Q(s)(f)+c.
2: Homogenity For any a≥0 and function f, Q(s)(a⋅f)=a⋅Q(s)(f).
3: Subadditivity For any functions f,g, Q(s)(f+g)≤Q(s)(f)+Q(s)(g).
4: Zero bound For any function f≤0, Q(s)(f)≤0.
Rephrasing this, though we don’t know what the optimization-y process s is, it’s quite plausible that it’ll fulfill the following four properties.
1: If you add a constant to the input function, you’ll get that constant added to your score.
2: If you rescale the input function, it rescales the score the optimization process attains.
3: Optimizing the sum of two functions does worse than optimizing them separately and adding your best scores together.
4: Optimizing a function that’s never positive can’t produce a positive score.
Exercise: Which of these properties does softmax break? Which of these properties does gradient ascent with infinitesimal step-size break?
Also, notice that the first two properties combined are effectively saying “if you try to optimize a utility function, the optimization process will ignore scales and shifts in that utility function”.
Theorem 4: Crappy Optimizer Theorem For any selection process s where Q(s) fulfills the four properties above, ∀f:Q(s)(f)=maxμ∈Ψμ(f) will hold for some closed convex set of probability distributions Ψ. Conversely, the function f↦maxμ∈Ψμ(f) for any closed convex set Ψ will fulfill the four properties of an optimization process.
Informal Corollary 3: Any selection process s where Q(s) fulfills the four properties is effectively just argmax but using a restricted set of probability distributions.
Other Aspects of Infrafunctions
Infrafunctions are the analogue of random variables in inframeasure theory. Here are two useful properties of them.
First off, in the classical case of functions, we can average functions together. There’s a distinguished averaging function Δ(X→R)→(X→R). Pretty obvious.
When we go to infrafunctions, this gets extended somewhat. There’s a distinguished function □FX→FX, where □ is the space of infradistributions, and FX is the space of infrafunctions. If you know enough category theory, we can say that the space of infrafunctions is a □-algebra, where □ is the infradistribution monad. If you don’t know enough category theory, basically, there’s an infra-version of “averaging points together” and it makes all the diagrams commute really nicely.
Proposition 3: The space of infrafunctions is a □-algebra, with the function flatFX:□FX→FX being defined as λψ.λμ.ψ(λF.F(μ)).
Also, in the classical case, if you’ve got a function g:X→Y, that produces a function (Y→R)→(X→R) in a pretty obvious way. Just precompose. f↦(x↦f(g(x))).
A similar thing holds here. Except, in this case, instead of a function, we can generalize further to a continuous infrakernel k:X→□Y. Again, to get a function FY→FX, you just precompose. G↦(μ↦G(k∗(μ)). Take the distribution μ, shove it through the infrakernel k to get an infradistribution on Y, and shove that through the infrafunction G.
Proposition 4: All continuous infrakernels k:X→□Y induce a function FY→FX via G↦(μ↦G(k∗(μ))
So, given an infrakernel (and functions are a special case of this) going one way, you can transfer infrafunctions backwards from one space to the other.
The restriction to continuous infrakernels, though, actually does matter here. A function f:X→Y induces two infrakernels, one of type X→□Y (the image), and one of type Y→□X (the preimage). So, theoretically we could get a function of type (X→Y)→(FX→FY) by routing through the preimage function. However, since it requires continuity of the function Y→□X to make things work out, you can only reverse the direction if the function mapping a point x to its preimage is Hausdorff-continuous. So, for functions with Hausdorff-continuous inverses, you can flip the usual direction and go (X→Y)→(FX→FY). But this trick doesn’t work in general, only (X→□Y)→(FY→FX) is valid in general.
There’s a bunch of other things you can do, like intersection of infrafunctions making a new infrafunction, and union making a new infrafunction. Really, most of the same stuff as works with infradistributions.
The field is wide-open. But it’s a single framework that can accomodate Scott’s generalized epistemic states, Scott’s geometric averaging, Knightian uncertainty, intersecting and unioning that uncertainty, averaging, quantilizers, dynamic consistency, worst-case reasoning, and approximate maximization. So it seems quite promising for future use.