We would like a mathematical theory which characterises the intuitive notion of ‘optimisation’.
Before Shannon introduced his mathematical theory of communication, the concept of ‘information’ was vague and informal. Then Shannon devised several standardised measures of it, with useful properties and clear operational interpretations. It turned out the concept of information is universal, in that it can be quantified on a consistent scale (bits) across various contexts.
Could something similar be true for ‘optimisation’?
In this post we review a few proposed ways to measure optimisation power and play around with them a bit.
Our general setup will be that we’re choosing actions from a set A to achieve an outcome in a set Ω. We have some beliefs about how our actions will affect outcomes in the form of a probability distribution pa∈ΔΩ for each a∈A. We also have some preferences, either in the form of a preference ordering over outcomes, or a utility function over outcomes. We will also assume there is some default distribution p∈ΔΩ, which we can interpret either as our beliefs about Ω if we don’t act, or if we take some default action[1].
Yudkowsky
Yudkowsky’s proposed definition just makes use of a preference ordering ⪰ over Ω. To measure the optimisation power of some outcome x, we count up all the outcomes which are at least as good as x, and divide that number by the total number of possible outcomes. It’s nice to take a negative log to turn this fraction into a number of bits: OP(x)=−log|{x′∈Ω∣x′⪰x}||Ω|.
If I achieve the second best outcome out of eight, that’s −log28=2 bits of optimisation power.
If the outcome space is infinite, then we can’t count the number of outcomes at least as good as the one we got, so we need a measure to integrate over. If we make use of our default probability distribution here, the resulting quantity has a nice interpretation.∫x′⪰xp(x′)dx′∫p(x′)dx′ is just ∫x′⪰xp(x′)dx′: the default probability of doing as well as we did. Since we’re always assuming we’ve got a default distribution, we might as well define OP like this even in the finite-domain case. Again we’ll take a log to get OP(x)=−log∫x′⪰xp(x′)dx′. Now 2 bits of optimisation power means the default probability of doing this well was 14.
So far we’ve just been thinking about the optimisation power of achieving a specific outcome. We can define the optimisation power of an action as the expected optimisation power under the distribution it induces over outcomes: OP(a)=∫pa(x)OP(x)dx.
The above definitions just make use of a preference ordering. If we do have a utility function u:Ω→R then we’d like our definitions to make use of that too. Intuitively, achieving the second best outcome out of three should constitute more optimisation power in a case where it’s almost as good as the first and much better than the third, compared to a case where it’s only slightly better than the third and much less good than the first[2].
Analogously to how we previously asked ‘what fraction of the default probability mass is on outcomes at least as good as this one?’ we could try to ask ‘what fraction of the default expected utility comes from outcomes at least as good as this one?’.
But making use of utility functions in the above definition is tricky. Recall that utility functions originate from the Von Neumann-Morgenstern theorem, which says that if an agent choosing between probabilistic mixtures of options satisfies some weak rationality criteria then it acts as if it maximises expected utility according to a utility function u:Ω→R. The utility function produced by the VNM-theorem is only defined up to positive affine transformations, meaning that the utility function u′=au+b, for any a∈R>0 and b∈R, equally well represents the tradeoffs between outcomes the agent is willing to take. Another way of looking at this is that only ratios of differences in utility are real. Therefore any definition of OP which changes when you multiply your utilities by a positive scalar or add a constant is very questionable. Alex Altair ran into this problem when trying to define OP in terms of the rate of change of utility.
One way we considered to get around this is to measure the utility of every outcome relative to the worst one. Let xw=argminx∈Ωu(x) and define OP(x)=−log∫x′⪰xp(x′)(u(x′)−u(xw))dx′∫p(x′)(u(x′)−u(xw))dx′.
In words, this asks ‘what fraction of the default extra expected utility on top of the minimum comes from outcomes at least this good?’.
This is invariant to translating and rescaling utility. Unfortunately it has a problem of its own—it’s sensitive to our choice of xw. By adding some made up element to Ω with large negative utility and zero probability of occurring, we can make OP arbitrarily low. In that case basically all of the default relative expected utility comes from avoiding the worst outcome, which is guaranteed, so you don’t get any credit for optimising.
Wentworth
An alternative approach to measuring optimisation in bits comes from John’s Utility Maximisation = Description Length Minimisation. In the language of our setup, John shows that you can take any utility function u:Ω→R[3], and map it to a probability distribution qu∈ΔΩ, such that maximising expected utility with respect to u is equivalent to minimising cross-entropy with respect to qu.
At first we weren’t sure of the precise sense of equivalence John meant, but one way to state it is that the interval scale over actions a∈A which you obtain by considering Epau(x) is precisely the same as the interval scale you get by considering −H(pa,qu). This will become very explicit with a bit of rejigging.
John defines qu by taking a softmax: qu(x)=eu(x)∑x′eu(x′). But we’ll keep working in base 2 and write qu(x)=2u(x)∑x′2u(x′) instead[4].
So we get that −H(pa,qu)=−−Epalogqu(x)=Epalog2u(x)∑x′2u(x′)=Epalog2u(x)−Epalog∑x′2u(x′)=Epau(x)−log∑x′2u(x′)
The term on the right doesn’t depend on pa or x, so what this means is that −H(pa,qu)=Epau(x)+Cu for some constant Cu. In other words, this switch from maximising expected utility to minimising cross-entropy is just like translating your utility function by a constant and carrying on as you were.
Since utility functions are only defined up to translation anyway, this means the sense of equivalence is as strong as can be, but it does dispel the impression that this is a useful way to measure optimisation power in bits. The number of extra ‘bits’ of optimisation power that some action a1 has compared to a2 is precisely the number of extra expected utilons—and that means that scaling your utility function scales your optimisation power by the same amount.
Takeaways
We would like a measure of the amount of optimisation taking place which makes use of utility functions, which is invariant under postive affine transformations of those utility functions, and which isn’t too sensitive to strange things like the utility of the worst outcome. The search continues.
As an aside: notice that multiplying the utility function u by a positive constant α will change the resulting softmax distributions qαu(x)=qu(x)α to be more sharply peaked at the maxima.
Towards Measures of Optimisation
We would like a mathematical theory which characterises the intuitive notion of ‘optimisation’.
Before Shannon introduced his mathematical theory of communication, the concept of ‘information’ was vague and informal. Then Shannon devised several standardised measures of it, with useful properties and clear operational interpretations. It turned out the concept of information is universal, in that it can be quantified on a consistent scale (bits) across various contexts.
Could something similar be true for ‘optimisation’?
In this post we review a few proposed ways to measure optimisation power and play around with them a bit.
Our general setup will be that we’re choosing actions from a set A to achieve an outcome in a set Ω. We have some beliefs about how our actions will affect outcomes in the form of a probability distribution pa∈ΔΩ for each a∈A. We also have some preferences, either in the form of a preference ordering over outcomes, or a utility function over outcomes. We will also assume there is some default distribution p∈ΔΩ, which we can interpret either as our beliefs about Ω if we don’t act, or if we take some default action[1].
Yudkowsky
Yudkowsky’s proposed definition just makes use of a preference ordering ⪰ over Ω. To measure the optimisation power of some outcome x, we count up all the outcomes which are at least as good as x, and divide that number by the total number of possible outcomes. It’s nice to take a negative log to turn this fraction into a number of bits: OP(x)=−log|{x′∈Ω∣x′⪰x}||Ω|. If I achieve the second best outcome out of eight, that’s −log28=2 bits of optimisation power.
If the outcome space is infinite, then we can’t count the number of outcomes at least as good as the one we got, so we need a measure to integrate over. If we make use of our default probability distribution here, the resulting quantity has a nice interpretation.∫x′⪰xp(x′)dx′∫p(x′)dx′ is just ∫x′⪰xp(x′)dx′: the default probability of doing as well as we did. Since we’re always assuming we’ve got a default distribution, we might as well define OP like this even in the finite-domain case. Again we’ll take a log to get OP(x)=−log∫x′⪰xp(x′)dx′. Now 2 bits of optimisation power means the default probability of doing this well was 14.
So far we’ve just been thinking about the optimisation power of achieving a specific outcome. We can define the optimisation power of an action as the expected optimisation power under the distribution it induces over outcomes: OP(a)=∫pa(x)OP(x)dx. The above definitions just make use of a preference ordering. If we do have a utility function u:Ω→R then we’d like our definitions to make use of that too. Intuitively, achieving the second best outcome out of three should constitute more optimisation power in a case where it’s almost as good as the first and much better than the third, compared to a case where it’s only slightly better than the third and much less good than the first[2].
Analogously to how we previously asked ‘what fraction of the default probability mass is on outcomes at least as good as this one?’ we could try to ask ‘what fraction of the default expected utility comes from outcomes at least as good as this one?’.
But making use of utility functions in the above definition is tricky. Recall that utility functions originate from the Von Neumann-Morgenstern theorem, which says that if an agent choosing between probabilistic mixtures of options satisfies some weak rationality criteria then it acts as if it maximises expected utility according to a utility function u:Ω→R. The utility function produced by the VNM-theorem is only defined up to positive affine transformations, meaning that the utility function u′=au+b, for any a∈R>0 and b∈R, equally well represents the tradeoffs between outcomes the agent is willing to take. Another way of looking at this is that only ratios of differences in utility are real. Therefore any definition of OP which changes when you multiply your utilities by a positive scalar or add a constant is very questionable. Alex Altair ran into this problem when trying to define OP in terms of the rate of change of utility.
One way we considered to get around this is to measure the utility of every outcome relative to the worst one. Let xw=argminx∈Ωu(x) and define OP(x)=−log∫x′⪰xp(x′)(u(x′)−u(xw))dx′∫p(x′)(u(x′)−u(xw))dx′.
In words, this asks ‘what fraction of the default extra expected utility on top of the minimum comes from outcomes at least this good?’.
This is invariant to translating and rescaling utility. Unfortunately it has a problem of its own—it’s sensitive to our choice of xw. By adding some made up element to Ω with large negative utility and zero probability of occurring, we can make OP arbitrarily low. In that case basically all of the default relative expected utility comes from avoiding the worst outcome, which is guaranteed, so you don’t get any credit for optimising.
Wentworth
An alternative approach to measuring optimisation in bits comes from John’s Utility Maximisation = Description Length Minimisation. In the language of our setup, John shows that you can take any utility function u:Ω→R[3], and map it to a probability distribution qu∈ΔΩ, such that maximising expected utility with respect to u is equivalent to minimising cross-entropy with respect to qu.
At first we weren’t sure of the precise sense of equivalence John meant, but one way to state it is that the interval scale over actions a∈A which you obtain by considering Epau(x) is precisely the same as the interval scale you get by considering −H(pa,qu). This will become very explicit with a bit of rejigging.
John defines qu by taking a softmax: qu(x)=eu(x)∑x′eu(x′). But we’ll keep working in base 2 and write qu(x)=2u(x)∑x′2u(x′) instead[4].
So we get that −H(pa,qu)=−−Epalogqu(x)=Epalog2u(x)∑x′2u(x′)=Epalog2u(x)−Epalog∑x′2u(x′)=Epau(x)−log∑x′2u(x′)
The term on the right doesn’t depend on pa or x, so what this means is that −H(pa,qu)=Epau(x)+Cu for some constant Cu. In other words, this switch from maximising expected utility to minimising cross-entropy is just like translating your utility function by a constant and carrying on as you were.
Since utility functions are only defined up to translation anyway, this means the sense of equivalence is as strong as can be, but it does dispel the impression that this is a useful way to measure optimisation power in bits. The number of extra ‘bits’ of optimisation power that some action a1 has compared to a2 is precisely the number of extra expected utilons—and that means that scaling your utility function scales your optimisation power by the same amount.
Takeaways
We would like a measure of the amount of optimisation taking place which makes use of utility functions, which is invariant under postive affine transformations of those utility functions, and which isn’t too sensitive to strange things like the utility of the worst outcome. The search continues.
Or even before we act.
See this Stuart Armstrong post for related discussion.
Where is finite.
As an aside: notice that multiplying the utility function u by a positive constant α will change the resulting softmax distributions qαu(x)=qu(x)α to be more sharply peaked at the maxima.