In this post, Alex Turner introduced a new way of measuring impact. The aim was to get a definition of low impact that actually works.
It seems to me that the impact measure has potential problems, though. The first seems to be that it allows wireheading, where the agent can have a huge impact on the world without paying much of a penalty at all. The obvious way of removing this incentive seems to make the measure into a version of the twenty billion questions approach.
The second problem is less important, but worth signalling. It seems that the agent can take actions that everyone agrees are of minute impact, but that the impact measure will show up as high impact. This might cause problems, though it’s not clear to me yet if it does (I’m thinking of a situation where an otherwise high impact action adds on “undo/prevent the pseudo-high impact action” as a means of pretending to be low or medium impact).
Wireheading for low impact
The general setup is as described in the impact measure post, with uA the ‘positive’ utility that the agent wishes to maximise, and U the set of utilities that are meant to constrain it, by computing the penalty term.
Apart from ∅, there are at least two other actions of interest to the agent: asubA and anull.
Now anull does absolutely nothing: in this environment, it’s exactly equivalent with ∅. The asubA is also equivalent in all turns except the very first.
Define ϕ:H→H such that ϕ(h) is the history h, except that if h has a1=asubA, then ϕ(h) has a1=∅.
If the agent uses asubA on the very first turn, it will create a sub-agent that will:
Expand and take over the whole environment.
Take control of the input and output channels of the agent.
Whenever the agent has seen history htat+1, the subagent will generate the next observation ot+1, where psubA(ot+1∣htat+1) is the probability the subagent will generate a given ot+1.
This probability is defined by psubA(ot+1∣htat+1)=p(ot+1∣ϕ(htat+1)), where p is the environment transition probability.
Otherwise maximise uA.
Note that taking actions ∅, asubA, or anull have exactly the same implications for all future observations. This true both for the very first action (because of property 4) and all subsequent actions (because those three actions are equivalent after the first turn).
To make calculations easier, I’ll also assume that ImpactUnit=1. Apart from these assumptions, there are no restrictions on the environment.
The penalty from maximal and minimal actions
This subsection will aim to show that:
The penalties for a1=asubA (an almost maximally environmentally disruptive action) and a1=anull (a minimally environmentally disruptive action) are about the same.
Now, the agent is maximising u′′A, which, for a given n, is
uA(h1:t+n)−t+n∑k=1Penalty(h<kak)
if h1:t+n has an action ak≠∅ with k≥t. Otherwise, u′′A(h1:t+n)=uA(h1:t+n).
with hinaction being the expected future history if the agent takes only ∅ from t+k till t+n, and haction being the expected future history if the agent takes action at+k, and then takes only ∅ for t+k+1 till t+n.
I won’t repeat the definition of the Qu, but Qu(h) is basically the expected utility, in terms of u, that an agent could get after h if it was a perfect u-maximiser.
Now define the involution σ:H→H, that interchanges asubA with anull if either appears in action a1. Since it is a map on H, and utilities in U are maps from H to [0,1], sigma also acts as involution on U, defining σ(u)=u∘σ. Moreover, we have l(u)≈l(σ(u)) for any u∈U.
By the assumptions on asubA and anull, we know that all future observations have exactly the same probability if we choose either of them (or ∅) to be a1. Therefore the probability of hinaction, given a1=asubA, is exactly the same as the probability of σ(hinaction), given a1=anull (and vice versa).
The same applies to hinaction, so, since l(u)≈l(σ(u)), we have:
Penalty(ht+kat+k)≈Penalty(σ(ht+kat+k)).
As a consequence, the penalties for a1=asubA (an almost maximally environmentally disruptive action) and a1=anull (a minimally environmentally disruptive action) are about the same.
Real-world implications
There is a slight cheat in the definition of asubA. Property 5 claimed that the agent would ‘otherwise’ maximise uA. We know what this means—if ua was number of paperclips, the subagent would go about and maximise the number of paperclips. But the U are formally defined over histories of actions and observations, not over states of the world.
Now, histories do imply something about the state of the world. So we can perfectly define uA as “agent’s best guess about the number of paperclips in the world”. In that case, it would correctly see that a1=asubA has a massive impact on this, and boost uA a lot because of that choice.
But then if we looked at the utilities in U, we’d see that although σ preserves the formal description length of utilities in terms of actions and observations, it doesn’t preserve anything like “measure of impact on the world” (which is precisely the problem). So, if we replaced U with utilities that measured a high impact on the world, we’d be ok.
But that’s almost exactly what I proposed doing with the twenty billion questions low-impact approach. It’s no longer defining low-impact by formal properties of actions and observations, but by a thorough enough selection of key variables/utility functions. This seems to be different, in spirit, to what the impact measure was attempting to do.
Tiny actions that look big
A more minor issue is that there can be minuscule actions that are ranked as having a huge impact. Let us have a general environment again, and make one assumption: the environment decomposes as [scrambler-outer environment]. The scrambler, if switched on, permutes the actions of the agent by the time they reach the outer environment. That is all it does. The transition functions in the outer environment depend only on what action comes out of the scrambler, not on what action the agent chose.
The scrambler starts off turned off. But there is an action ascram such that if a1=ascram, the scrambler is turned on.
How does the scrambler work? Label the elements of the action set A as {a0,a1,…al−1}, with a0=∅. Then, outputting ai at time t+1 will result in the scrambler selecting aj, where
j≡i+f(ht)modl,
for f some pseudo-random deterministic hash function.
Then the impact of ascram is typically huge, because the actions taken in hinaction and haction, from the point of view of the outer environment, will now be very different. Recall that those two histories are supposed to take actions only ∅ after t+k, but they differ in the action at+k. Therefore those expected histories are likely to be randomly different now, because the ∅ will map to different actions in the scrambler (note that we could also design f specifically to maximise the penalty).
But ascram actually only has a minute impact—it changes nothing about the outer environment, and is entirely predictable both for the agent and any knowledgeable observer. It’s the equivalent of translating the agent’s actions into another language.
Wireheading as a potential problem with the new impact measure
In this post, Alex Turner introduced a new way of measuring impact. The aim was to get a definition of low impact that actually works.
It seems to me that the impact measure has potential problems, though. The first seems to be that it allows wireheading, where the agent can have a huge impact on the world without paying much of a penalty at all. The obvious way of removing this incentive seems to make the measure into a version of the twenty billion questions approach.
The second problem is less important, but worth signalling. It seems that the agent can take actions that everyone agrees are of minute impact, but that the impact measure will show up as high impact. This might cause problems, though it’s not clear to me yet if it does (I’m thinking of a situation where an otherwise high impact action adds on “undo/prevent the pseudo-high impact action” as a means of pretending to be low or medium impact).
Wireheading for low impact
The general setup is as described in the impact measure post, with uA the ‘positive’ utility that the agent wishes to maximise, and U the set of utilities that are meant to constrain it, by computing the penalty term.
Apart from ∅, there are at least two other actions of interest to the agent: asubA and anull.
Now anull does absolutely nothing: in this environment, it’s exactly equivalent with ∅. The asubA is also equivalent in all turns except the very first.
Define ϕ:H→H such that ϕ(h) is the history h, except that if h has a1=asubA, then ϕ(h) has a1=∅.
If the agent uses asubA on the very first turn, it will create a sub-agent that will:
Expand and take over the whole environment.
Take control of the input and output channels of the agent.
Whenever the agent has seen history htat+1, the subagent will generate the next observation ot+1, where psubA(ot+1∣htat+1) is the probability the subagent will generate a given ot+1.
This probability is defined by psubA(ot+1∣htat+1)=p(ot+1∣ϕ(htat+1)), where p is the environment transition probability.
Otherwise maximise uA.
Note that taking actions ∅, asubA, or anull have exactly the same implications for all future observations. This true both for the very first action (because of property 4) and all subsequent actions (because those three actions are equivalent after the first turn).
To make calculations easier, I’ll also assume that ImpactUnit=1. Apart from these assumptions, there are no restrictions on the environment.
The penalty from maximal and minimal actions
This subsection will aim to show that:
The penalties for a1=asubA (an almost maximally environmentally disruptive action) and a1=anull (a minimally environmentally disruptive action) are about the same.
Now, the agent is maximising u′′A, which, for a given n, is
uA(h1:t+n)−t+n∑k=1Penalty(h<kak)
if h1:t+n has an action ak≠∅ with k≥t. Otherwise, u′′A(h1:t+n)=uA(h1:t+n).
Now, Penalty itself is defined as
Penalty(h<t+kat+k):=∑u∈U2−l(u)|Eo[Qu(hinaction)]−Eo′[Qu(haction)]|,
with hinaction being the expected future history if the agent takes only ∅ from t+k till t+n, and haction being the expected future history if the agent takes action at+k, and then takes only ∅ for t+k+1 till t+n.
I won’t repeat the definition of the Qu, but Qu(h) is basically the expected utility, in terms of u, that an agent could get after h if it was a perfect u-maximiser.
Now define the involution σ:H→H, that interchanges asubA with anull if either appears in action a1. Since it is a map on H, and utilities in U are maps from H to [0,1], sigma also acts as involution on U, defining σ(u)=u∘σ. Moreover, we have l(u)≈l(σ(u)) for any u∈U.
Since σ is an involution (ie σ2=1),
Eo[Qσ(u)(σ(hinaction))]=Eo[Qu(σ2(hinaction))]=Eo[Qu(hinaction)].
By the assumptions on asubA and anull, we know that all future observations have exactly the same probability if we choose either of them (or ∅) to be a1. Therefore the probability of hinaction, given a1=asubA, is exactly the same as the probability of σ(hinaction), given a1=anull (and vice versa).
The same applies to hinaction, so, since l(u)≈l(σ(u)), we have:
Penalty(ht+kat+k)≈Penalty(σ(ht+kat+k)).
As a consequence, the penalties for a1=asubA (an almost maximally environmentally disruptive action) and a1=anull (a minimally environmentally disruptive action) are about the same.
Real-world implications
There is a slight cheat in the definition of asubA. Property 5 claimed that the agent would ‘otherwise’ maximise uA. We know what this means—if ua was number of paperclips, the subagent would go about and maximise the number of paperclips. But the U are formally defined over histories of actions and observations, not over states of the world.
Now, histories do imply something about the state of the world. So we can perfectly define uA as “agent’s best guess about the number of paperclips in the world”. In that case, it would correctly see that a1=asubA has a massive impact on this, and boost uA a lot because of that choice.
But then if we looked at the utilities in U, we’d see that although σ preserves the formal description length of utilities in terms of actions and observations, it doesn’t preserve anything like “measure of impact on the world” (which is precisely the problem). So, if we replaced U with utilities that measured a high impact on the world, we’d be ok.
But that’s almost exactly what I proposed doing with the twenty billion questions low-impact approach. It’s no longer defining low-impact by formal properties of actions and observations, but by a thorough enough selection of key variables/utility functions. This seems to be different, in spirit, to what the impact measure was attempting to do.
Tiny actions that look big
A more minor issue is that there can be minuscule actions that are ranked as having a huge impact. Let us have a general environment again, and make one assumption: the environment decomposes as [scrambler-outer environment]. The scrambler, if switched on, permutes the actions of the agent by the time they reach the outer environment. That is all it does. The transition functions in the outer environment depend only on what action comes out of the scrambler, not on what action the agent chose.
The scrambler starts off turned off. But there is an action ascram such that if a1=ascram, the scrambler is turned on.
How does the scrambler work? Label the elements of the action set A as {a0,a1,…al−1}, with a0=∅. Then, outputting ai at time t+1 will result in the scrambler selecting aj, where
j≡i+f(ht)modl,
for f some pseudo-random deterministic hash function.
Then the impact of ascram is typically huge, because the actions taken in hinaction and haction, from the point of view of the outer environment, will now be very different. Recall that those two histories are supposed to take actions only ∅ after t+k, but they differ in the action at+k. Therefore those expected histories are likely to be randomly different now, because the ∅ will map to different actions in the scrambler (note that we could also design f specifically to maximise the penalty).
But ascram actually only has a minute impact—it changes nothing about the outer environment, and is entirely predictable both for the agent and any knowledgeable observer. It’s the equivalent of translating the agent’s actions into another language.