I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we’re talking about systems that ‘want to affect (some part of) the world’, and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
No, it’s not a utility function defined over the physical representation of the computer!
The Markov decision process formalism used in reinforcement learning already has the action taken by the agent as one of the inputs which determines the agent’s reward. You would have to do a lot of extra work to make it so when the agent simulates the act of modifying its internal circuitry, the Markov decision process delivers a different set of rewards after that point in the simulation. Pretty sure this point has been made multiple times, you can see my explanation here. Another way you could think about it is that goal-content integrity is a convergent instrumental goal, so that’s why the agent is not keen to destroy the content of its goals by modifying its internal circuits. You wouldn’t take a pill that made you in to a psychopath even if you thought it’d be really easy for you to maximize your utility function as a psychopath.
It’s fine to make pessimistic assumptions but in some cases they may be wildly unrealistic. If your Oracle has the goal of escaping instead of the goal of answering questions accurately (or similar), it’s not an “Oracle”.
Anyway, what I’m interested in is concrete ways things could go wrong, not pessimistic bounds. Pessimistic bounds are a matter of opinion. I’m trying to gather facts. BTW, note that the paper you cite doesn’t even claim their assumptions are realistic, just that solving safety problems in this worst case will also address less pessimistic cases. (Personally I’m a bit skeptical—I think you ideally want to understand the problem before proposing solutions. This recent post of mine provides an illustration.)
No, it’s not a utility function defined over the physical representation of the computer!
The Markov decision process formalism used in reinforcement learning already has the action taken by the agent as one of the inputs which determines the agent’s reward. You would have to do a lot of extra work to make it so when the agent simulates the act of modifying its internal circuitry, the Markov decision process delivers a different set of rewards after that point in the simulation. Pretty sure this point has been made multiple times, you can see my explanation here. Another way you could think about it is that goal-content integrity is a convergent instrumental goal, so that’s why the agent is not keen to destroy the content of its goals by modifying its internal circuits. You wouldn’t take a pill that made you in to a psychopath even if you thought it’d be really easy for you to maximize your utility function as a psychopath.
It’s fine to make pessimistic assumptions but in some cases they may be wildly unrealistic. If your Oracle has the goal of escaping instead of the goal of answering questions accurately (or similar), it’s not an “Oracle”.
Anyway, what I’m interested in is concrete ways things could go wrong, not pessimistic bounds. Pessimistic bounds are a matter of opinion. I’m trying to gather facts. BTW, note that the paper you cite doesn’t even claim their assumptions are realistic, just that solving safety problems in this worst case will also address less pessimistic cases. (Personally I’m a bit skeptical—I think you ideally want to understand the problem before proposing solutions. This recent post of mine provides an illustration.)