TL;DR:The ordinal definition of optimization (i.e. pushing the state of the world up some preference ordering without a [utility/reward/value]-like measure on the state) appears to be quite limiting, casting some doubt on its epistemic value/usefulness.
1. Introduction and definitions
Given a set of world states W, a total orderingΘ over them, and an update function that maps a previous world state to the next one (update:W→W), optimization can be defined as the process of the world states being reliably sent to world states that are higher in the ordering Θ. We then say that the world states are getting increasingly optimized. We then call that Θ a “preference ordering”. If wt+1=update(wt), wt+1 is “weakly better” than wt, if wt+1 is weakly Θ-greater than wt (written wt≾Θwt+1). If Θ assigns the ordinal ω (e.g. an integer, or any other value of some totally ordered set) to w∈W, we say that ω is the Θ-degree of w.
To talk about optimization in this sense, we don’t need to assume that the world is divided into the “agent”/”optimizer”/”policy” and the “environment”. For example, given a container with gas as our W, we can define Θ over W, according to the entropy of the gas, such that higher entropy states are “better” (Θ-greater). If the container starts in a very ordered state (e.g. molecules of one type taking up one half of the container and molecules of the other type taking up the other half), then the gas in the container will transition to a more Θ-optimal state, even though we can’t assign responsibility for optimization to any part of the system. Similarly, evolution can be viewed as an optimization process, even though there is (arguably) no distinguishable part of the biosphere or evolutionary environment doing the optimization.[2]
Nevertheless, often we do want to separate W into the “part doing the optimizing” (agent/optimizer/policy) and the “part being optimized” (environment). (Alex Flint calls this property of separability of an optimizing system into the optimizer and the optimized “duality”.) One reason we want to do it is that, given a fixed Θ over an environment E and a set of agents A to pair with E, we can talk about an agent that maximizes the state of the environment according to Θ[3] when run as a dynamical system, either for some pre-specified number of timesteps or until a state has been reached, that belongs to some pre-specified set of final states (e.g. states that are at least as Θ-good as some w′∈W).
Moreover, we can fix some reference agent Aref to serve as the “measuring stick” of optimization. It can (but doesn’t have to) be a null/reference agent that always outputs what we can interpret as a “null action”, not interfering with the environment, letting it “run its default course”. In any case, we can measure the optimization power/potential of the other agents by assessing how much better or worse they perform at optimizing E than Aref.
2. Deterministic optimization
In a deterministic setting, the transitions between the states are determined by a deterministic function update:W→W. Given dualism, we can decompose it into the agent part and the environment part W≃A×E where A∈A is the set of “internal” states of a specific agent and E is the set of possible environment states. Since we can pair the environment with different agents that differ in the policies they implement, this makes the type signature dependent on A: updateA:(A×E)→(A×E). From the set A, we can pick an agent that reliably leads to the maximization of Θ defined over E.
However, there is a catch in that it is not clear what “maximization of a preference ordering” means. We can choose a specific time horizon tend and define the maximization of Θ as making it as high as possible at the timestep tend. The best optimizer from our set A is then defined as the following (allowing for some abuse of notation):
A∗:=argmaxA∈A{Θ(updatetend(a0,e0))}
Here, a0∈A is the initial state of an agent A and e0∈E is the initial state of the environment.
However, tend is a free parameter, a choice that needs to be made. Is there a natural way to choose tend, at least for some restricted class of dynamical systems that can be reasonably viewed as optimization problems?
If we have an episode with a finite number of timesteps T, then we can measure the optimization step at the final timestep tend=T. If the number of timesteps is infinite, we need to aggregate the ordinal preference values of the states across the trajectory[4] of updates of A paired with E, {Θ(updatet(a0,e0))|t∈N}. If the trajectory is somewhat/approximately convergent, then we measure the degree of optimization at the point where the trajectory passes our “test of convergence” (whatever it is). However, this test of convergence is another free parameter that needs to be chosen.
What other options are there? We can’t just calculate the average optimization degree of the trajectory because Θ doesn’t assign real/numeric values to the states. We could do something simple like map the degrees of the environment to natural numbers in the order of increasingly preferred states. The least preferred state gets 0, the second least preferred state gets 1, and so on, up to the number of possible states (if the number of states is finite) or countable[5] infinity (if there are countably infinitely many states). However, it feels like an unjustified hack, similar to choosing a discrete uniform distribution over a set to represent Knightian uncertainty over that set.
The only operations available to us to aggregate the Θ-degrees across the timesteps of a trajectory are those that choose some quantile of them, according to Θ, such as min, max, median and so on.min is not very satisfying because then the best agent is the one who minimizes the “degradation” of the state, according to Θ. More or less, this is yet another free parameter to set.[6]
One way out of this puzzle would be to use the measures of optimization proposed by Alex Altair[7] or something similar. The most basic one is absolute optimization, defined as “the number of bits needed to describe how far up the ordering the state is”. In general, it uses a probability distribution over states to find their optimal encoding (i.e. the one that minimizes the average/expected number of bits that need to be used to describe a state). In the deterministic setting, we don’t have probabilities,[8] but we can still use absolute optimization. Given n=|E| possible states of the environment, we use log2(n) bits to describe each state.[9]
Another intuitive solution might be to assess whether the state reliably keeps getting better over time. Again, it is not clear what “reliably keeps getting better” means. We could formalize this as “every next state is no worse than the previous one” (∀t,wt≾Θwt+1). The problem is that it would exclude agents that let the state “degrade a bit”, but compensate by improving it over the baseline in the future, which arguably constitute many situations that we would like to be able to view in terms of optimization. Alternatively, we could chunk the trajectory into non-overlapping episodes and focus on the degrees of those episodes. This leaves us with two problems/free parameters: how to chunk up the trajectory into episodes and how to aggregate the Θ-degrees of the states within an episode into the Θ′-degree of the episode, so we’re back to square one.
In summary, it is not obvious how to compare agents tasked with optimizing a deterministic environment. There are many ways to measure it and it’s not obvious that one of them is the most preferred/natural.
3. Stochastic optimization
Things get (even) more problematic with stochasticity. Now the update function outputs a probability distribution that the next world state is sampled from, update:ΔW→ΔW (or, assuming dualism, updateA:Δ(A×E)→Δ(A×E)).[10] Instead of a specific initial state, we start with a distribution over the initial state (in the simplest case, the Dirac delta functionδ that assigns 1 to one state and 0 to every other state). It may still be the case that there is an optimal A∗∈A that causes (perhaps approximately) deterministic updates that Θ-optimize the environment (leaving us with the problems outlined in the previous section). However, in full generality, we will need to be comparing agents like the following:
A1 achieves the most preferred state with a probability of 0.5 and the third most preferred state with a probability of 0.5.
A2 achieves the second most preferred state with a probability of 0.99 and the fourth most preferred state with a probability of 0.01.
There is no way to compare the expected optimization of the world by A1 versus A2, given just a preference ordering. We need a value function over the states that assigns real values to them.[11]
We could convert Θ into a value function by using the methods proposed by Alex Altair. Given a reference probability distribution P over the states, we can measure the (absolute) optimization of an environment state e as
Ωabs(e)=−logpe
(in bits) where pe is the cumulative probability of all the states at least as preferred as x, i.e.
pe=∑e≾Θe′P(e′)
Such a reference probability distribution could, for example, be obtained using the reference agent Aref, as outlined in Section 1. Pairing Aref with the environment gives us state transition probabilities, rather than probabilities over the states. Given a fixed distribution over the initial state, we can use these transition probabilities to compute the reference probability distribution over states per each timestep. This would give us a time-parametrized family of probability distributions, Pt:ΔE for each t, which we could use to derive pte for each timestep t and environment state e.
Having that optimization measure Ωabs doesn’t eliminate the problem of measuring optimization but makes it more well-behaved. We can use some variant of discounting (e.g. exponential or hyperbolic) and/or choose a time horizon tend such that we sum the values obtained up to that tend (again, with some abuse of notation):
Note, however, that this turns the optimization problem into a reward-maximization problem.[12] The best agent (from a given class of agents under consideration) is the one that maximizes the sum of expected rewards, either an infinite discounted sum or a non-discounted sum over some finite horizon.
In his optimization toolbox draft, Alex suggests a different way to measure optimization power, using what he calls “counterfactual optimization”. However, this method makes it equivalent to fixing some specified future timestep and maximizing the expected value achieved at that time step. (Here I’m using the expected-value version of counterfactual optimization and discarding the terms that don’t depend on the choice of agent A.)
If assessing agents’ optimization power forces us to use formalisms equivalent to reward or value maximization, this casts some doubt on the usefulness of the concept of ordinal optimization. Given that most interesting/real-world environments are (practically, relative to our epistemic access) stochastic, ordinal optimization will need to be cast into the realm of R anyway to measure optimization.
In hindsight, perhaps it is not that surprising. Intuitively, it “doesn’t feel natural” to say that what distinguishes the degree of optimization of two states, according to some “measure of optimization”, is “just” their placement along an ordering Θ without any meaningful difference in the “distances” between the Θ-neighboring states.[13]
A possible direction of research is to investigate the conditions in which a total ordering over states can be used to induce a(n approximately) total ordering over trajectories, avoiding the need to assign them real values (h/t Alex Altair).
Unless you want to try to talk about (something like) “laws of nature” as doing the optimizing, but the ontological status of those is a whole another can of worms, so I’m glossing over that here.
Restricting Θ to the environment is not strictly necessary, but it makes things simpler by eliminating analogs of utility monsters, i.e. agents that change their internal states to “hijack” Θ even if they do not optimize the environment that much.
FWIW the way I think about optimization, one never aggregates across the trajectory, because the “goodness” of later states totally screens off the “goodness” of earlier states.
Although, I guess here you’re breaking off the agent part, which means that the internal state of the agent isn’t in the environment anymore. Some of my intuition for the above is because if an agent places value in knowing or remembering that something happened, that would be reflected in the state of its mind, which would then be reflected in the total state ordering.
I haven’t thought about it much, but it seems that we get interesting results if we combine a quantile measure with a specific time window (or more generally a selection of timesteps). The higher the quantile of our score function of choice (with max being the extreme example), the more we will be selecting for an agent that is just trying to hit the peak once. Lower quantile score functions (with min being the extreme example) will be “trying” to “ensure” the stability of the outcome that has been achieved so that it “doesn’t degrade too much”. We choose a tradeoff between maximizing top gains and minimizing the margin of the losses. This seems like an interesting insight but is probably best illustrated with some specific and graphical examples but it’s beyond the scope of the post.
I’m using ΔX to denote the set of probability distributions over the set X. It could also be update:W→ΔW, in which case we would have to average out the probabilities whenever we apply update to a distribution over W, i.e. going ΔWupdate⟶Δ(ΔW)reduce⟶ΔW with update being “mapped” into the probability distribution. I decided to use ΔW→ΔW because it felt more appropriate for the case of unrolling trajectories of varying probabilities. If we were interested in sampling trajectories, W→ΔW would be more appropriate.
I decided to use the term “value” rather than “utility” or “reward” because it has fewer problematic/[potentially misleading] connotations. “Utility” invokes vNM rationality which is not assumed here. “Reward” invokes reinforcement learning which is not assumed here either.
Some Problems with Ordinal Optimization Frame
A result of work done during AI Safety Camp Virtual 2024. Thanks[1] to Alex Altair, Alfred Harwood, Amaury Lorin, Jasmina Nasufi, Tyler Tracy, and Einar Urdshals for related conversations and/or feedback on the post.
TL;DR: The ordinal definition of optimization (i.e. pushing the state of the world up some preference ordering without a [utility/reward/value]-like measure on the state) appears to be quite limiting, casting some doubt on its epistemic value/usefulness.
1. Introduction and definitions
Given a set of world states W, a total ordering Θ over them, and an update function that maps a previous world state to the next one (update:W→W), optimization can be defined as the process of the world states being reliably sent to world states that are higher in the ordering Θ. We then say that the world states are getting increasingly optimized. We then call that Θ a “preference ordering”. If wt+1=update(wt), wt+1 is “weakly better” than wt, if wt+1 is weakly Θ-greater than wt (written wt≾Θwt+1). If Θ assigns the ordinal ω (e.g. an integer, or any other value of some totally ordered set) to w∈W, we say that ω is the Θ-degree of w.
To talk about optimization in this sense, we don’t need to assume that the world is divided into the “agent”/”optimizer”/”policy” and the “environment”. For example, given a container with gas as our W, we can define Θ over W, according to the entropy of the gas, such that higher entropy states are “better” (Θ-greater). If the container starts in a very ordered state (e.g. molecules of one type taking up one half of the container and molecules of the other type taking up the other half), then the gas in the container will transition to a more Θ-optimal state, even though we can’t assign responsibility for optimization to any part of the system. Similarly, evolution can be viewed as an optimization process, even though there is (arguably) no distinguishable part of the biosphere or evolutionary environment doing the optimization.[2]
Nevertheless, often we do want to separate W into the “part doing the optimizing” (agent/optimizer/policy) and the “part being optimized” (environment). (Alex Flint calls this property of separability of an optimizing system into the optimizer and the optimized “duality”.) One reason we want to do it is that, given a fixed Θ over an environment E and a set of agents A to pair with E, we can talk about an agent that maximizes the state of the environment according to Θ[3] when run as a dynamical system, either for some pre-specified number of timesteps or until a state has been reached, that belongs to some pre-specified set of final states (e.g. states that are at least as Θ-good as some w′∈W).
Moreover, we can fix some reference agent Aref to serve as the “measuring stick” of optimization. It can (but doesn’t have to) be a null/reference agent that always outputs what we can interpret as a “null action”, not interfering with the environment, letting it “run its default course”. In any case, we can measure the optimization power/potential of the other agents by assessing how much better or worse they perform at optimizing E than Aref.
2. Deterministic optimization
In a deterministic setting, the transitions between the states are determined by a deterministic function update:W→W. Given dualism, we can decompose it into the agent part and the environment part W≃A×E where A∈A is the set of “internal” states of a specific agent and E is the set of possible environment states. Since we can pair the environment with different agents that differ in the policies they implement, this makes the type signature dependent on A: updateA:(A×E)→(A×E). From the set A, we can pick an agent that reliably leads to the maximization of Θ defined over E.
However, there is a catch in that it is not clear what “maximization of a preference ordering” means. We can choose a specific time horizon tend and define the maximization of Θ as making it as high as possible at the timestep tend. The best optimizer from our set A is then defined as the following (allowing for some abuse of notation):
A∗:=argmaxA ∈ A{Θ(updatetend(a0,e0))}Here, a0∈A is the initial state of an agent A and e0∈E is the initial state of the environment.
However, tend is a free parameter, a choice that needs to be made. Is there a natural way to choose tend, at least for some restricted class of dynamical systems that can be reasonably viewed as optimization problems?
If we have an episode with a finite number of timesteps T, then we can measure the optimization step at the final timestep tend=T. If the number of timesteps is infinite, we need to aggregate the ordinal preference values of the states across the trajectory[4] of updates of A paired with E, {Θ(updatet(a0,e0)) | t∈N}. If the trajectory is somewhat/approximately convergent, then we measure the degree of optimization at the point where the trajectory passes our “test of convergence” (whatever it is). However, this test of convergence is another free parameter that needs to be chosen.
What other options are there? We can’t just calculate the average optimization degree of the trajectory because Θ doesn’t assign real/numeric values to the states. We could do something simple like map the degrees of the environment to natural numbers in the order of increasingly preferred states. The least preferred state gets 0, the second least preferred state gets 1, and so on, up to the number of possible states (if the number of states is finite) or countable[5] infinity (if there are countably infinitely many states). However, it feels like an unjustified hack, similar to choosing a discrete uniform distribution over a set to represent Knightian uncertainty over that set.
The only operations available to us to aggregate the Θ-degrees across the timesteps of a trajectory are those that choose some quantile of them, according to Θ, such as min, max, median and so on.min is not very satisfying because then the best agent is the one who minimizes the “degradation” of the state, according to Θ. More or less, this is yet another free parameter to set.[6]
One way out of this puzzle would be to use the measures of optimization proposed by Alex Altair[7] or something similar. The most basic one is absolute optimization, defined as “the number of bits needed to describe how far up the ordering the state is”. In general, it uses a probability distribution over states to find their optimal encoding (i.e. the one that minimizes the average/expected number of bits that need to be used to describe a state). In the deterministic setting, we don’t have probabilities,[8] but we can still use absolute optimization. Given n=|E| possible states of the environment, we use log2(n) bits to describe each state.[9]
Another intuitive solution might be to assess whether the state reliably keeps getting better over time. Again, it is not clear what “reliably keeps getting better” means. We could formalize this as “every next state is no worse than the previous one” (∀t,wt≾Θwt+1). The problem is that it would exclude agents that let the state “degrade a bit”, but compensate by improving it over the baseline in the future, which arguably constitute many situations that we would like to be able to view in terms of optimization. Alternatively, we could chunk the trajectory into non-overlapping episodes and focus on the degrees of those episodes. This leaves us with two problems/free parameters: how to chunk up the trajectory into episodes and how to aggregate the Θ-degrees of the states within an episode into the Θ′-degree of the episode, so we’re back to square one.
In summary, it is not obvious how to compare agents tasked with optimizing a deterministic environment. There are many ways to measure it and it’s not obvious that one of them is the most preferred/natural.
3. Stochastic optimization
Things get (even) more problematic with stochasticity. Now the update function outputs a probability distribution that the next world state is sampled from, update:ΔW→ΔW (or, assuming dualism, updateA:Δ(A×E)→Δ(A×E)).[10] Instead of a specific initial state, we start with a distribution over the initial state (in the simplest case, the Dirac delta function δ that assigns 1 to one state and 0 to every other state). It may still be the case that there is an optimal A∗∈A that causes (perhaps approximately) deterministic updates that Θ-optimize the environment (leaving us with the problems outlined in the previous section). However, in full generality, we will need to be comparing agents like the following:
A1 achieves the most preferred state with a probability of 0.5 and the third most preferred state with a probability of 0.5.
A2 achieves the second most preferred state with a probability of 0.99 and the fourth most preferred state with a probability of 0.01.
There is no way to compare the expected optimization of the world by A1 versus A2, given just a preference ordering. We need a value function over the states that assigns real values to them.[11]
We could convert Θ into a value function by using the methods proposed by Alex Altair. Given a reference probability distribution P over the states, we can measure the (absolute) optimization of an environment state e as
Ωabs(e)=−logpe(in bits) where pe is the cumulative probability of all the states at least as preferred as x, i.e.
pe=∑e ≾Θ e′P(e′)Such a reference probability distribution could, for example, be obtained using the reference agent Aref, as outlined in Section 1. Pairing Aref with the environment gives us state transition probabilities, rather than probabilities over the states. Given a fixed distribution over the initial state, we can use these transition probabilities to compute the reference probability distribution over states per each timestep. This would give us a time-parametrized family of probability distributions, Pt:ΔE for each t, which we could use to derive pte for each timestep t and environment state e.
Having that optimization measure Ωabs doesn’t eliminate the problem of measuring optimization but makes it more well-behaved. We can use some variant of discounting (e.g. exponential or hyperbolic) and/or choose a time horizon tend such that we sum the values obtained up to that tend (again, with some abuse of notation):
A∗:=argmaxA ∈ A[∞∑t=0γt⋅(∑e∈EP(et=e|e0,A)⋅Ωabs(et))]Note, however, that this turns the optimization problem into a reward-maximization problem.[12] The best agent (from a given class of agents under consideration) is the one that maximizes the sum of expected rewards, either an infinite discounted sum or a non-discounted sum over some finite horizon.
In his optimization toolbox draft, Alex suggests a different way to measure optimization power, using what he calls “counterfactual optimization”. However, this method makes it equivalent to fixing some specified future timestep and maximizing the expected value achieved at that time step. (Here I’m using the expected-value version of counterfactual optimization and discarding the terms that don’t depend on the choice of agent A.)
A∗:=argmaxA ∈ A E[Ωcf(e0,t|A)]=argmaxA ∈ A E[Ωabs(updateA(a0,e0)t)−Ωabs(e0)−Ωavg(E,t)]∝argmaxA ∈ A E[Ωabs(updateA(a0,e0)t)]4. Conclusion and takeaways
If assessing agents’ optimization power forces us to use formalisms equivalent to reward or value maximization, this casts some doubt on the usefulness of the concept of ordinal optimization. Given that most interesting/real-world environments are (practically, relative to our epistemic access) stochastic, ordinal optimization will need to be cast into the realm of R anyway to measure optimization.
In hindsight, perhaps it is not that surprising. Intuitively, it “doesn’t feel natural” to say that what distinguishes the degree of optimization of two states, according to some “measure of optimization”, is “just” their placement along an ordering Θ without any meaningful difference in the “distances” between the Θ-neighboring states.[13]
A possible direction of research is to investigate the conditions in which a total ordering over states can be used to induce a(n approximately) total ordering over trajectories, avoiding the need to assign them real values (h/t Alex Altair).
Ordered alphabetically, by last name.
Unless you want to try to talk about (something like) “laws of nature” as doing the optimizing, but the ontological status of those is a whole another can of worms, so I’m glossing over that here.
Restricting Θ to the environment is not strictly necessary, but it makes things simpler by eliminating analogs of utility monsters, i.e. agents that change their internal states to “hijack” Θ even if they do not optimize the environment that much.
A comment by Alex:
If we have uncountably many states, we have even more difficulties.
I haven’t thought about it much, but it seems that we get interesting results if we combine a quantile measure with a specific time window (or more generally a selection of timesteps). The higher the quantile of our score function of choice (with max being the extreme example), the more we will be selecting for an agent that is just trying to hit the peak once. Lower quantile score functions (with min being the extreme example) will be “trying” to “ensure” the stability of the outcome that has been achieved so that it “doesn’t degrade too much”. We choose a tradeoff between maximizing top gains and minimizing the margin of the losses. This seems like an interesting insight but is probably best illustrated with some specific and graphical examples but it’s beyond the scope of the post.
Albeit Alex himself said that (much of) the credit for this idea of measuring optimization (power) belongs to Eliezer Yudkowsky.
The closest thing we could get is the long-run frequencies of the environment when paired with the reference agent.
At least as long as we have finitely many states.
I’m using ΔX to denote the set of probability distributions over the set X. It could also be update:W→ΔW, in which case we would have to average out the probabilities whenever we apply update to a distribution over W, i.e. going ΔWupdate⟶Δ(ΔW)reduce⟶ΔW with update being “mapped” into the probability distribution. I decided to use ΔW→ΔW because it felt more appropriate for the case of unrolling trajectories of varying probabilities. If we were interested in sampling trajectories, W→ΔW would be more appropriate.
I decided to use the term “value” rather than “utility” or “reward” because it has fewer problematic/[potentially misleading] connotations. “Utility” invokes vNM rationality which is not assumed here. “Reward” invokes reinforcement learning which is not assumed here either.
H/t to Alex Altair for bringing my attention to the distinction between reward and utility maximization.
Although maybe it’s just my intuition?