Well, the reason I mentioned the “utility function over different states of matter” thing is because if your utility function isn’t specified over states of matter, but is instead specified over your actions (e.g. behave in a way that’s as corrigible as possible), you don’t necessarily get instrumental convergence.
I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we’re talking about systems that ‘want to affect (some part of) the world’, and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
My impression is that early thinking about Oracles wasn’t really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there’s no real reason to believe these early “Oracle” models are an accurate description of current or future (un)supervised learning systems.
It seems possible that something like this has happened. Though as far as I know, we don’t currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.
How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase “training distribution” then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?
Therefore, I’m sympathetic to the following perspective, from Armstrong and O’Rourke (2018) (the last sentence was also quoted in the grandparent):
we will deliberately assume the worst about the potential power of the Oracle, treating it as being arbitrarily super-intelligent. This assumption is appropriate because, while there is much uncertainty about what kinds of AI will be developed in future, solving safety problems in the most difficult case can give us an assurance of safety in the easy cases too. Thus, we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).
I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we’re talking about systems that ‘want to affect (some part of) the world’, and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
No, it’s not a utility function defined over the physical representation of the computer!
The Markov decision process formalism used in reinforcement learning already has the action taken by the agent as one of the inputs which determines the agent’s reward. You would have to do a lot of extra work to make it so when the agent simulates the act of modifying its internal circuitry, the Markov decision process delivers a different set of rewards after that point in the simulation. Pretty sure this point has been made multiple times, you can see my explanation here. Another way you could think about it is that goal-content integrity is a convergent instrumental goal, so that’s why the agent is not keen to destroy the content of its goals by modifying its internal circuits. You wouldn’t take a pill that made you in to a psychopath even if you thought it’d be really easy for you to maximize your utility function as a psychopath.
It’s fine to make pessimistic assumptions but in some cases they may be wildly unrealistic. If your Oracle has the goal of escaping instead of the goal of answering questions accurately (or similar), it’s not an “Oracle”.
Anyway, what I’m interested in is concrete ways things could go wrong, not pessimistic bounds. Pessimistic bounds are a matter of opinion. I’m trying to gather facts. BTW, note that the paper you cite doesn’t even claim their assumptions are realistic, just that solving safety problems in this worst case will also address less pessimistic cases. (Personally I’m a bit skeptical—I think you ideally want to understand the problem before proposing solutions. This recent post of mine provides an illustration.)
I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we’re talking about systems that ‘want to affect (some part of) the world’, and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
It seems possible that something like this has happened. Though as far as I know, we don’t currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.
How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase “training distribution” then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?
Therefore, I’m sympathetic to the following perspective, from Armstrong and O’Rourke (2018) (the last sentence was also quoted in the grandparent):
No, it’s not a utility function defined over the physical representation of the computer!
The Markov decision process formalism used in reinforcement learning already has the action taken by the agent as one of the inputs which determines the agent’s reward. You would have to do a lot of extra work to make it so when the agent simulates the act of modifying its internal circuitry, the Markov decision process delivers a different set of rewards after that point in the simulation. Pretty sure this point has been made multiple times, you can see my explanation here. Another way you could think about it is that goal-content integrity is a convergent instrumental goal, so that’s why the agent is not keen to destroy the content of its goals by modifying its internal circuits. You wouldn’t take a pill that made you in to a psychopath even if you thought it’d be really easy for you to maximize your utility function as a psychopath.
It’s fine to make pessimistic assumptions but in some cases they may be wildly unrealistic. If your Oracle has the goal of escaping instead of the goal of answering questions accurately (or similar), it’s not an “Oracle”.
Anyway, what I’m interested in is concrete ways things could go wrong, not pessimistic bounds. Pessimistic bounds are a matter of opinion. I’m trying to gather facts. BTW, note that the paper you cite doesn’t even claim their assumptions are realistic, just that solving safety problems in this worst case will also address less pessimistic cases. (Personally I’m a bit skeptical—I think you ideally want to understand the problem before proposing solutions. This recent post of mine provides an illustration.)