May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs.
It seems worth pointing out that due to the inner alignment problem, we shouldn’t assume that naively training, say, unsupervised learning models with human-level capabilities (e.g. for the purpose of generating novel FAI proposals) will be safe — conditioned on it being possible capabilities-wise.
Those specific failure modes seem to me like potential convergent instrumental goals of arbitrarily capable systems that “want to affect the world” and are in an air-gapped computer.
I’m not sure whether you’re asking about my thoughts on:
how can ‘(un)supervised learning at arbitrarily large scale’ produce such systems; or
conditioned on such systems existing, why might they have convergent instrumental goals that look like those failure modes.
Those specific failure modes seem to me like potential convergent instrumental goals of arbitrarily capable systems that “want to affect the world” and are in an air-gapped computer.
My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I’m not sure how the concept applies in other cases. Like, if we aren’t using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals? (I’m not trying to be rhetorical or antagonistic—I really want to hear if you can think of something.)
I’m interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that’s potentially smarter than we are.
My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I’m not sure how the concept applies in other cases.
I’m confused about the “I’m not sure how the concept applies in other cases” part. It seems to me that ‘arbitrarily capable systems that “want to affect the world” and are in an air-gapped computer’ are a special case of ‘agents which want to achieve a broad variety of utility functions over different states of matter’.
Like, if we aren’t using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals?
I’m not sure what’s the interpretation of ‘unintended optimization’, but I think that a sufficiently broad interpretation would cover the failure modes I’m talking about here.
I’m interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that’s potentially smarter than we are.
I agree.
So the following is a pending question that I haven’t addressed here yet:
Would ‘(un)supervised learning at arbitrarily large scale’ produce arbitrarily capable systems that “want to affect the world”?
I won’t address this here, but I think this is a very important question that deserves a thorough examination (I plan to reply here with another comment if I’ll end up writing something about it). For now I’ll note that my best guess is that most AI safety researchers think that it’s at least plausible (>10%) that the answer to that question is “yes”.
I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not ‘want to affect the world’). Here’s some supporting evidence for this:
Stuart Armstrong and Xavier O’Rourke wrote in their Safe Uses of AI Oracles paper:
we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).
Stuart Russell wrote in his book Human Compatible (2019):
if the objective of the Oracle AI system is to provide accurate answers to questions in a reasonable amount of time, it will have an incentive to break out of its cage to acquire more computational resources and to control the questioners so that they ask only simple questions.
I’m confused about the “I’m not sure how the concept applies in other cases” part. It seems to me that ‘arbitrarily capable systems that “want to affect the world” and are in an air-gapped computer’ are a special case of ‘agents which want to achieve a broad variety of utility functions over different states of matter’.
Well, the reason I mentioned the “utility function over different states of matter” thing is because if your utility function isn’t specified over states of matter, but is instead specified over your actions (e.g. behave in a way that’s as corrigible as possible), you don’t necessarily get instrumental convergence.
I’m not sure what’s the interpretation of ‘unintended optimization’, but I think that a sufficiently broad interpretation would cover the failure modes I’m talking about here.
“Unintended optimization. First, the possibility of mesa-optimization means that an advanced ML system could end up implementing a powerful optimization procedure even if its programmers never intended it to do so.”—Source. “Daemon” is an older term.
I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not ‘want to affect the world’).
My impression is that early thinking about Oracles wasn’t really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there’s no real reason to believe these early “Oracle” models are an accurate description of current or future (un)supervised learning systems.
Well, the reason I mentioned the “utility function over different states of matter” thing is because if your utility function isn’t specified over states of matter, but is instead specified over your actions (e.g. behave in a way that’s as corrigible as possible), you don’t necessarily get instrumental convergence.
I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we’re talking about systems that ‘want to affect (some part of) the world’, and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
My impression is that early thinking about Oracles wasn’t really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there’s no real reason to believe these early “Oracle” models are an accurate description of current or future (un)supervised learning systems.
It seems possible that something like this has happened. Though as far as I know, we don’t currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.
How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase “training distribution” then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?
Therefore, I’m sympathetic to the following perspective, from Armstrong and O’Rourke (2018) (the last sentence was also quoted in the grandparent):
we will deliberately assume the worst about the potential power of the Oracle, treating it as being arbitrarily super-intelligent. This assumption is appropriate because, while there is much uncertainty about what kinds of AI will be developed in future, solving safety problems in the most difficult case can give us an assurance of safety in the easy cases too. Thus, we model the Oracle as a reward-maximising agent facing an MDP, who has a goal of escaping (meaning the Oracle gets the maximum possible reward for escaping its containment, and a strictly lower reward in other situations).
I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we’re talking about systems that ‘want to affect (some part of) the world’, and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
No, it’s not a utility function defined over the physical representation of the computer!
The Markov decision process formalism used in reinforcement learning already has the action taken by the agent as one of the inputs which determines the agent’s reward. You would have to do a lot of extra work to make it so when the agent simulates the act of modifying its internal circuitry, the Markov decision process delivers a different set of rewards after that point in the simulation. Pretty sure this point has been made multiple times, you can see my explanation here. Another way you could think about it is that goal-content integrity is a convergent instrumental goal, so that’s why the agent is not keen to destroy the content of its goals by modifying its internal circuits. You wouldn’t take a pill that made you in to a psychopath even if you thought it’d be really easy for you to maximize your utility function as a psychopath.
It’s fine to make pessimistic assumptions but in some cases they may be wildly unrealistic. If your Oracle has the goal of escaping instead of the goal of answering questions accurately (or similar), it’s not an “Oracle”.
Anyway, what I’m interested in is concrete ways things could go wrong, not pessimistic bounds. Pessimistic bounds are a matter of opinion. I’m trying to gather facts. BTW, note that the paper you cite doesn’t even claim their assumptions are realistic, just that solving safety problems in this worst case will also address less pessimistic cases. (Personally I’m a bit skeptical—I think you ideally want to understand the problem before proposing solutions. This recent post of mine provides an illustration.)
It seems worth pointing out that due to the inner alignment problem, we shouldn’t assume that naively training, say, unsupervised learning models with human-level capabilities (e.g. for the purpose of generating novel FAI proposals) will be safe — conditioned on it being possible capabilities-wise.
Are you referring to the possibility of unintended optimization, or is there something more?
If “unintended optimization” referrers only to the inner alignment problem, then there’s also the malign prior problem.
Yes (for a very broad interpretation of ‘optimization’). I mentioned some potential failure modes in this comment.
Do you have any thoughts on how specifically those failure modes might come about?
Those specific failure modes seem to me like potential convergent instrumental goals of arbitrarily capable systems that “want to affect the world” and are in an air-gapped computer.
I’m not sure whether you’re asking about my thoughts on:
how can ‘(un)supervised learning at arbitrarily large scale’ produce such systems; or
conditioned on such systems existing, why might they have convergent instrumental goals that look like those failure modes.
My understanding is convergent instrumental goals are goals which are useful to agents which want to achieve a broad variety of utility functions over different states of matter. I’m not sure how the concept applies in other cases. Like, if we aren’t using RL, and there is no unintended optimization, why specifically would there be pressure to achieve convergent instrumental goals? (I’m not trying to be rhetorical or antagonistic—I really want to hear if you can think of something.)
I’m interested in #1. It seems like the most promising route is to prevent unintended optimization from arising in the first place, instead of trying to outwit a system that’s potentially smarter than we are.
Sorry for the delayed response!
I’m confused about the “I’m not sure how the concept applies in other cases” part. It seems to me that ‘arbitrarily capable systems that “want to affect the world” and are in an air-gapped computer’ are a special case of ‘agents which want to achieve a broad variety of utility functions over different states of matter’.
I’m not sure what’s the interpretation of ‘unintended optimization’, but I think that a sufficiently broad interpretation would cover the failure modes I’m talking about here.
I agree. So the following is a pending question that I haven’t addressed here yet: Would ‘(un)supervised learning at arbitrarily large scale’ produce arbitrarily capable systems that “want to affect the world”?
I won’t address this here, but I think this is a very important question that deserves a thorough examination (I plan to reply here with another comment if I’ll end up writing something about it). For now I’ll note that my best guess is that most AI safety researchers think that it’s at least plausible (>10%) that the answer to that question is “yes”.
I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not ‘want to affect the world’). Here’s some supporting evidence for this:
Stuart Armstrong and Xavier O’Rourke wrote in their Safe Uses of AI Oracles paper:
Stuart Russell wrote in his book Human Compatible (2019):
Well, the reason I mentioned the “utility function over different states of matter” thing is because if your utility function isn’t specified over states of matter, but is instead specified over your actions (e.g. behave in a way that’s as corrigible as possible), you don’t necessarily get instrumental convergence.
“Unintended optimization. First, the possibility of mesa-optimization means that an advanced ML system could end up implementing a powerful optimization procedure even if its programmers never intended it to do so.”—Source. “Daemon” is an older term.
My impression is that early thinking about Oracles wasn’t really informed by how (un)supervised systems actually work, and the intellectual momentum from that early thinking has carried to the present, even though there’s no real reason to believe these early “Oracle” models are an accurate description of current or future (un)supervised learning systems.
I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we’re talking about systems that ‘want to affect (some part of) the world’, and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
It seems possible that something like this has happened. Though as far as I know, we don’t currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.
How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase “training distribution” then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?
Therefore, I’m sympathetic to the following perspective, from Armstrong and O’Rourke (2018) (the last sentence was also quoted in the grandparent):
No, it’s not a utility function defined over the physical representation of the computer!
The Markov decision process formalism used in reinforcement learning already has the action taken by the agent as one of the inputs which determines the agent’s reward. You would have to do a lot of extra work to make it so when the agent simulates the act of modifying its internal circuitry, the Markov decision process delivers a different set of rewards after that point in the simulation. Pretty sure this point has been made multiple times, you can see my explanation here. Another way you could think about it is that goal-content integrity is a convergent instrumental goal, so that’s why the agent is not keen to destroy the content of its goals by modifying its internal circuits. You wouldn’t take a pill that made you in to a psychopath even if you thought it’d be really easy for you to maximize your utility function as a psychopath.
It’s fine to make pessimistic assumptions but in some cases they may be wildly unrealistic. If your Oracle has the goal of escaping instead of the goal of answering questions accurately (or similar), it’s not an “Oracle”.
Anyway, what I’m interested in is concrete ways things could go wrong, not pessimistic bounds. Pessimistic bounds are a matter of opinion. I’m trying to gather facts. BTW, note that the paper you cite doesn’t even claim their assumptions are realistic, just that solving safety problems in this worst case will also address less pessimistic cases. (Personally I’m a bit skeptical—I think you ideally want to understand the problem before proposing solutions. This recent post of mine provides an illustration.)