This is really interesting. I thought I understood it and I wanted to verify that by trying to summarize it (and maybe help others too) but now I’m not so sure…
Edit: Just to save anybody the reading time, my reasoning is false below. After sleeping on it, I see my mistake. Nothing below the “False Thermodynamic Miracles” subheading made sense to me yesterday because I thought the purpose of the setup was to have an “off switch” on the simulated AI under the false belief (letting it see the result of the signal after some time period). I get it now though. Maximizing “[P(no miracle) C] + [P(miracle) u(A) given miracle]” is the same as maximizing “u(A) given miracle”. So the AI will act as if the miracle happened, because there’s no cost associated with those actions if it didn’t happen—only benefits if it did.
As I understand it, the basic premise is that the AI has a utility function u(w,A) where “w” is whether or not the specific thermodynamic miracle (TM) occurred and A represents a regular/normal input to a utility function (state of the universe, action taken, or whatever).
u(w,A) = C (a low constant) if w is false (TM did not happen), and u(w,A) is essentially “u(A) given w” when w is true (ie. it’s a “normal” kind of utility function, dominated by things other than the TM, though it still has to incorporate the fact that the TM happened into its model of the world).
So, the signal is sent, and it calculates its expected utility for either scenario. If w = false, u(w,A) = a low constant, and if w = true, it determines an action A that maximizes the utility function u(A)|w. As long as 2^-100 u( A ) > C, it starts taking action A.
Is this right so far?
Then, after “ON” has gone through, it must adapt itself and backtrack on its previous behaviour.
This mechanism seems underspecified to me (though maybe it’s obvious to others). Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that “ON” went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?
If we want the AI to actually backtrack, wouldn’t we need to modify the utility function so that it takes a configuration parameter as well? u(w,A,X) s.t. X = true iff the AI is in its initial configuration, and set it so that u(F,A,F) = 0, but u(F,A,T) = C ? (this probably counts as just an implementation detail, but I figured I’d mention it anyway)
The part that makes me feel like I didn’t actually understand it at all is:
However, the only possible world that it cares about, now, is one in which the “ON” signal was overwritten by a thermodynamic miracle… into exactly the same “ON” signal. As long as the possibility of this is non-zero, the AI will behave as if it’s certain.
Can you explain this a bit more? Do you mean that even after it observes “ON”, it doesn’t stop acting as if w = true? That P(TM that overwrites ON with ON) * u(A)|w > C ? If that’s the case, then it would never backtrack, right? So it’s essentially a full simulation of an AI under the assumption w, but with the knowledge that w is incredibly unlikely, and no built-in halting condition?
Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that “ON” went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?
I think the presumption is that the case where the “ON” signal goes thru normally and the case where the “ON” signal is overwritten by a thermodynamic miracle… into exactly the same “ON” signal are equivalent. That is that after the “ON” signal has gone though the AI would behave identically to an AI that was not indifferent to worlds where the thermodynamic miracle did not occur.
The reason for this is that although the chance that the “ON” signal was overwritten into exactly the same “ON” signal is tiny, it is the only remaining possible world that the AI cares about so it will act as if that is what it believes.
C need not be a low constant, btw. The only requirement is that u(false,action a, A) = u(false, action b, A) for all actions a and b and all A. ie nothing the AI does affects the utility of worlds where w is false, so this does not constrain its actions.
Basically the AI observes the ON signal going through, and knows that either a) the signal went through normally, or b) the signal was overwritten by coincidence by exactly the same signal. It’s actions have no consequences in the first case, so it ignores it, and acts “as if” it was certain there had been a thermodynamic miracle that happened.
Thanks. I understand now. Just needed to sleep on it, and today, your explanation makes sense.
Basically, the AI’s actions don’t matter if the unlikely event doesn’t happen, so it will take whatever actions would maximize its utility if the event did happen. This maximizes expected utility
Maximizing [P(no TM) C + P(TM) u(TM, A))] is the same as maximizing u(A) under assumption TM.
This is really interesting. I thought I understood it and I wanted to verify that by trying to summarize it (and maybe help others too) but now I’m not so sure…
Edit: Just to save anybody the reading time, my reasoning is false below. After sleeping on it, I see my mistake. Nothing below the “False Thermodynamic Miracles” subheading made sense to me yesterday because I thought the purpose of the setup was to have an “off switch” on the simulated AI under the false belief (letting it see the result of the signal after some time period). I get it now though. Maximizing “[P(no miracle) C] + [P(miracle) u(A) given miracle]” is the same as maximizing “u(A) given miracle”. So the AI will act as if the miracle happened, because there’s no cost associated with those actions if it didn’t happen—only benefits if it did.
As I understand it, the basic premise is that the AI has a utility function u(w,A) where “w” is whether or not the specific thermodynamic miracle (TM) occurred and A represents a regular/normal input to a utility function (state of the universe, action taken, or whatever).
u(w,A) = C (a low constant) if w is false (TM did not happen), and u(w,A) is essentially “u(A) given w” when w is true (ie. it’s a “normal” kind of utility function, dominated by things other than the TM, though it still has to incorporate the fact that the TM happened into its model of the world).
So, the signal is sent, and it calculates its expected utility for either scenario. If w = false, u(w,A) = a low constant, and if w = true, it determines an action A that maximizes the utility function u(A)|w. As long as 2^-100 u( A ) > C, it starts taking action A.
Is this right so far?
This mechanism seems underspecified to me (though maybe it’s obvious to others). Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that “ON” went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?
If we want the AI to actually backtrack, wouldn’t we need to modify the utility function so that it takes a configuration parameter as well? u(w,A,X) s.t. X = true iff the AI is in its initial configuration, and set it so that u(F,A,F) = 0, but u(F,A,T) = C ? (this probably counts as just an implementation detail, but I figured I’d mention it anyway)
The part that makes me feel like I didn’t actually understand it at all is:
Can you explain this a bit more? Do you mean that even after it observes “ON”, it doesn’t stop acting as if w = true? That P(TM that overwrites ON with ON) * u(A)|w > C ? If that’s the case, then it would never backtrack, right? So it’s essentially a full simulation of an AI under the assumption w, but with the knowledge that w is incredibly unlikely, and no built-in halting condition?
Thanks
I think the presumption is that the case where the “ON” signal goes thru normally and the case where the “ON” signal is overwritten by a thermodynamic miracle… into exactly the same “ON” signal are equivalent. That is that after the “ON” signal has gone though the AI would behave identically to an AI that was not indifferent to worlds where the thermodynamic miracle did not occur.
The reason for this is that although the chance that the “ON” signal was overwritten into exactly the same “ON” signal is tiny, it is the only remaining possible world that the AI cares about so it will act as if that is what it believes.
C need not be a low constant, btw. The only requirement is that u(false,action a, A) = u(false, action b, A) for all actions a and b and all A. ie nothing the AI does affects the utility of worlds where w is false, so this does not constrain its actions.
Basically the AI observes the ON signal going through, and knows that either a) the signal went through normally, or b) the signal was overwritten by coincidence by exactly the same signal. It’s actions have no consequences in the first case, so it ignores it, and acts “as if” it was certain there had been a thermodynamic miracle that happened.
Thanks. I understand now. Just needed to sleep on it, and today, your explanation makes sense.
Basically, the AI’s actions don’t matter if the unlikely event doesn’t happen, so it will take whatever actions would maximize its utility if the event did happen. This maximizes expected utility
Maximizing [P(no TM) C + P(TM) u(TM, A))] is the same as maximizing u(A) under assumption TM.
Yes, that’s a clear way of phrasing it.