Because of the problems with optimisers, there’s a search for other types of designs that can accomplish goals without the extreme optimisation pressure. Satisficers are one example of such designs, but satisficers aren’t reflectively stable (a satisficer can design a successor agent that isn’t a satisficer) and aren’t even reflectively consistent (there are situations where a satisficer will prefer to become a maximiser rather than staying a satisficer).
Here’s a design that doesn’t turn into a maximiser, at least not directly. It’s not useful for anything directly, but it does show some of the odd behaviours that certain agents can indulge in.
Doubly bounded satisficer
A Doubly-bounded satisficer (DBS) is one that has a utility function U and two bounds q<r. It’s aim is to take actions that ensure the expected utility E(U) of U is bounded between q and r.
It’s clear that, in general, the DBS won’t make itself into a U-maximiser (since that might have an expectation above r) or into a U-minimiser (since that might have an expectation below q). But how exactly will it behave?
Assume for illustration purposes that 0≤U≤1, and that ϵ>0 is very small.
No stochasticity
If there is no stochasticity or uncertainty, the DBS knows that for every policy π there is a corresponding value Uπ of U. Therefore it will select a policy such that Uπ∈[q,r].
Alternately, it can simply turn itself into a U′q,r maximiser, where U′q,r=1 iff U∈[q,r], and U′q,r=0 otherwise.
Stochastic choice between outcomes
Let’s add some uncertainty to the setup. The DBS knows that two numbers x and y will be drawn uniformly randomly from [0,1], and it will then have the choice between actions ax and actions ay. Then U∣ax=x and U∣ay=y.
Set [q,r] equal to [1/2−ϵ,1/2+ϵ]
In this case, maximising U′q,r (and breaking ties randomly) is still a good policy for the agent. But so is randomly choosing actions! Both these have an expected U of 1/2.
In contrast, maximising U′q,r and breaking ties by choosing the highest action, has an expected utility of around 2/3. So it’s not the U′q,r-maximising that’s doing the job here.
Set [q,r] equal to [2/3−ϵ,2/3+ϵ]
In this case, maximising U straight up is a good policy (and all good policies are close to that one). But that’s an artefact of these particular values of q and r.
Set [q,r] equal to [19/48−ϵ,19/48+ϵ]
In this case, one good policy is:
#. If x and y are both above 1/2 choose the highest option.
#. Otherwise, choose randomly.
This policy has an expectation of 1/4(1/3)+1/2(1/2)+1/4(1/4)=19/48, but there is no natural utility function maximisation that corresponds to this policy.
Value of information (and ignorance)
If the agent can self modify before receiving any extra information, then that information always has non-negative value (trivial proof: the agent can self modify into an agent that ignores the extra information, if need be).
But if the agent cannot self modify, then it might prefer not to know the information, depending on how it breaks ties and other considerations.
Generalisations
We can generalise this to a “satisficer” that wants expected utility to be in a certain subset S of the range of U, not necessarily an interval [q,r] (if S has a hole inside it, it becomes even more clear why an agent might want to avoid information: a mix of compatible policies need not be a compatible policy).
Anyway this agent is somewhat odd, but its an interesting agent that doesn’t immediately want to become a maximiser.
Not all non-maximisers become maximisers
Because of the problems with optimisers, there’s a search for other types of designs that can accomplish goals without the extreme optimisation pressure. Satisficers are one example of such designs, but satisficers aren’t reflectively stable (a satisficer can design a successor agent that isn’t a satisficer) and aren’t even reflectively consistent (there are situations where a satisficer will prefer to become a maximiser rather than staying a satisficer).
The general failure mode of an other-ising design is that it turns into a maximiser.
Here’s a design that doesn’t turn into a maximiser, at least not directly. It’s not useful for anything directly, but it does show some of the odd behaviours that certain agents can indulge in.
Doubly bounded satisficer
A Doubly-bounded satisficer (DBS) is one that has a utility function U and two bounds q<r. It’s aim is to take actions that ensure the expected utility E(U) of U is bounded between q and r.
It’s clear that, in general, the DBS won’t make itself into a U-maximiser (since that might have an expectation above r) or into a U-minimiser (since that might have an expectation below q). But how exactly will it behave?
Assume for illustration purposes that 0≤U≤1, and that ϵ>0 is very small.
No stochasticity
If there is no stochasticity or uncertainty, the DBS knows that for every policy π there is a corresponding value Uπ of U. Therefore it will select a policy such that Uπ∈[q,r].
Alternately, it can simply turn itself into a U′q,r maximiser, where U′q,r=1 iff U∈[q,r], and U′q,r=0 otherwise.
Stochastic choice between outcomes
Let’s add some uncertainty to the setup. The DBS knows that two numbers x and y will be drawn uniformly randomly from [0,1], and it will then have the choice between actions ax and actions ay. Then U∣ax=x and U∣ay=y.
Set [q,r] equal to [1/2−ϵ,1/2+ϵ]
In this case, maximising U′q,r (and breaking ties randomly) is still a good policy for the agent. But so is randomly choosing actions! Both these have an expected U of 1/2.
In contrast, maximising U′q,r and breaking ties by choosing the highest action, has an expected utility of around 2/3. So it’s not the U′q,r-maximising that’s doing the job here.
Set [q,r] equal to [2/3−ϵ,2/3+ϵ]
In this case, maximising U straight up is a good policy (and all good policies are close to that one). But that’s an artefact of these particular values of q and r.
Set [q,r] equal to [19/48−ϵ,19/48+ϵ]
In this case, one good policy is:
#. If x and y are both above 1/2 choose the highest option. #. Otherwise, choose randomly.
This policy has an expectation of 1/4(1/3)+1/2(1/2)+1/4(1/4)=19/48, but there is no natural utility function maximisation that corresponds to this policy.
Value of information (and ignorance)
If the agent can self modify before receiving any extra information, then that information always has non-negative value (trivial proof: the agent can self modify into an agent that ignores the extra information, if need be).
But if the agent cannot self modify, then it might prefer not to know the information, depending on how it breaks ties and other considerations.
Generalisations
We can generalise this to a “satisficer” that wants expected utility to be in a certain subset S of the range of U, not necessarily an interval [q,r] (if S has a hole inside it, it becomes even more clear why an agent might want to avoid information: a mix of compatible policies need not be a compatible policy).
Anyway this agent is somewhat odd, but its an interesting agent that doesn’t immediately want to become a maximiser.