A concept I’ve played with, coming off of Eliezer’s initial take on the problem of formulating optimization power, is: Suppose something generated N options randomly and then chose the best. Given the observed choice, what is the likelihood function for N?
For continuously distributed utilities, this can be computed directly using beta distributions. Beta(N, 1) is the probability density for the highest of N uniformly distributed unit random numbers. This includes numbers which are cumulative probabilities for a continuous distribution at values drawn from that distribution, and therefore numbers which are cumulative probabilities at the goodness of an observed choice. (N doesn’t have to be an integer, because beta distributions are defined for non-integer parameters.)
(The second step of this construction, where you attach a beta distribution to another distribution’s CDF, I had to work out by myself; it’s not directly mentioned in any discussions of extreme value statistics that I could find. The Mathworld page on order statistics, one step of generalization away, uses the formula for a beta CDF transformed by another CDF, but it still doesn’t refer to beta distributions by name.)
If the utilities are discretely distributed, you have to integrate the beta density over the interval of cumulative probabilities that invert to the observed utility.
To handle choices of mixtures, I guess you could modify this slightly, and ask about the likelihood function for N given the observed outcome, marginalizing over (as User:Kindly also suggests) the (unobserved) choice of option. This requires a distribution over options and a conditional distribution over observations given options. This would also cover situations with composite options where you only observe one of the aspects of the chosen option.
Opposed optimization might be very crudely modeled by increasing the number on the opposite side of the beta distribution from N. Somewhat related to this is Warren Smith’s “Fixed Point for Negamaxing Probability Distributions on Regular Trees”, which examines the distributions of position values that result when two opponents take turns choosing the worst option for each other.
Alternatively, instead of a likelihood function for N, you could have a likelihood function for an exponential weighting λ on the expected utility of the option:
Pr(A was chosen)/Pr(B was chosen) ∝ exp(λ(U(A)-U(B))).
Higher values of λ would be hypotheses in which better options were more strongly randomly selected over worse ones. (This is something like a logit-response model, for which λ (or “β”, or “1/μ”) would be the “rationality” parameter. It might be more familiar as a Gibbs distribution.) But this would fail when the expected utility from the null distribution was heavy-tailed, because then for some λ≠0 the distribution of optimized expected utilities would be impossible to normalize. Better would be for the cumulative probability at the expected utility of the chosen option to be what was exponentially weighted by λ. In that case, in the limit λ = (N-1) >> 1, the two models give the same distribution.
All of these statistics, as well as Eliezer’s original formulation, end up encoding equivalent information in the limit where the utility of each option is an independent sum of many identical light-tailed-distributed components and you’re predicting a marginal distribution of utilities for one of those components. In this limit you can safely convert everything to a statistical mechanics paradigm and back again.
Of course, the real criterion for a good formulation of optimization power is whether it helps people who use it in an argument about things that might be optimizers, or who hear it used in such an argument, to come to truthful conclusions.
In this respect, likelihood functions can have the problem that most people won’t want to use them: they’re hard to compute with, or communicate, unless they belong to a low-dimensional family. The likelihood functions I suggested won’t do that except under very simple conditions. I’m not sure what the best way would be to simplify them to something lower-dimensional. I guess you could just communicate a maximum-likelihood estimate and precision for the optimization power parameter. Or, if you chose a reference prior over optimization power, you could communicate its posterior mean and variance.
All of this presupposes that the problems with the unoptimized probability measure can be dealt with. Maybe it would work better to describe the optimization power of a system in terms of a series of levels of simpler systems leading up to that system, where each level’s new amount of optimization was only characterized approximately, and only relative to something like a distribution of outputs from the previous level. (This would at least patch the problem where, if thermodynamics is somehow involved in the results of an action, that action can count as very powerful relative to the uniform measure over the system’s microstates.) If optimizer B is sort of like choosing the best of N options generated by optimizer A, and optimizer C is sort of like choosing the best of M options generated by optimizer B, that might not have to mean that optimizer C is much like choosing the best of N*M options generated by optimizer A.
A concept I’ve played with, coming off of Eliezer’s initial take on the problem of formulating optimization power, is: Suppose something generated N options randomly and then chose the best. Given the observed choice, what is the likelihood function for N?
For continuously distributed utilities, this can be computed directly using beta distributions. Beta(N, 1) is the probability density for the highest of N uniformly distributed unit random numbers. This includes numbers which are cumulative probabilities for a continuous distribution at values drawn from that distribution, and therefore numbers which are cumulative probabilities at the goodness of an observed choice. (N doesn’t have to be an integer, because beta distributions are defined for non-integer parameters.)
(The second step of this construction, where you attach a beta distribution to another distribution’s CDF, I had to work out by myself; it’s not directly mentioned in any discussions of extreme value statistics that I could find. The Mathworld page on order statistics, one step of generalization away, uses the formula for a beta CDF transformed by another CDF, but it still doesn’t refer to beta distributions by name.)
If the utilities are discretely distributed, you have to integrate the beta density over the interval of cumulative probabilities that invert to the observed utility.
To handle choices of mixtures, I guess you could modify this slightly, and ask about the likelihood function for N given the observed outcome, marginalizing over (as User:Kindly also suggests) the (unobserved) choice of option. This requires a distribution over options and a conditional distribution over observations given options. This would also cover situations with composite options where you only observe one of the aspects of the chosen option.
Opposed optimization might be very crudely modeled by increasing the number on the opposite side of the beta distribution from N. Somewhat related to this is Warren Smith’s “Fixed Point for Negamaxing Probability Distributions on Regular Trees”, which examines the distributions of position values that result when two opponents take turns choosing the worst option for each other.
Alternatively, instead of a likelihood function for N, you could have a likelihood function for an exponential weighting λ on the expected utility of the option:
Pr(A was chosen)/Pr(B was chosen) ∝ exp(λ(U(A)-U(B))).
Higher values of λ would be hypotheses in which better options were more strongly randomly selected over worse ones. (This is something like a logit-response model, for which λ (or “β”, or “1/μ”) would be the “rationality” parameter. It might be more familiar as a Gibbs distribution.) But this would fail when the expected utility from the null distribution was heavy-tailed, because then for some λ≠0 the distribution of optimized expected utilities would be impossible to normalize. Better would be for the cumulative probability at the expected utility of the chosen option to be what was exponentially weighted by λ. In that case, in the limit λ = (N-1) >> 1, the two models give the same distribution.
All of these statistics, as well as Eliezer’s original formulation, end up encoding equivalent information in the limit where the utility of each option is an independent sum of many identical light-tailed-distributed components and you’re predicting a marginal distribution of utilities for one of those components. In this limit you can safely convert everything to a statistical mechanics paradigm and back again.
Of course, the real criterion for a good formulation of optimization power is whether it helps people who use it in an argument about things that might be optimizers, or who hear it used in such an argument, to come to truthful conclusions.
In this respect, likelihood functions can have the problem that most people won’t want to use them: they’re hard to compute with, or communicate, unless they belong to a low-dimensional family. The likelihood functions I suggested won’t do that except under very simple conditions. I’m not sure what the best way would be to simplify them to something lower-dimensional. I guess you could just communicate a maximum-likelihood estimate and precision for the optimization power parameter. Or, if you chose a reference prior over optimization power, you could communicate its posterior mean and variance.
All of this presupposes that the problems with the unoptimized probability measure can be dealt with. Maybe it would work better to describe the optimization power of a system in terms of a series of levels of simpler systems leading up to that system, where each level’s new amount of optimization was only characterized approximately, and only relative to something like a distribution of outputs from the previous level. (This would at least patch the problem where, if thermodynamics is somehow involved in the results of an action, that action can count as very powerful relative to the uniform measure over the system’s microstates.) If optimizer B is sort of like choosing the best of N options generated by optimizer A, and optimizer C is sort of like choosing the best of M options generated by optimizer B, that might not have to mean that optimizer C is much like choosing the best of N*M options generated by optimizer A.