One broad argument for AI risk is the Misspecified Goal argument
Do you have a citation for this? Who are you arguing against, or whose argument are you trying to clarify?
I am suggesting as a possibility that the Misspecified Goal argument relies on us incorrectly equating superintelligence with “pursuing a goal” because we use “pursuing a goal” as a default model for anything that can do interesting things, even if that is not the best model to be using.
I tend to have a different version of the Misspecified Goal argument in mind which I think doesn’t have this problem:
At least some humans are goal-directed at least some of the time.
It’s likely possible to create artificial agents that are better at achieving goals than humans.
It will be very tempting for some humans to build such agents in order to help those humans achieve their goals (either instrumental goals or their understanding of their terminal goals).
The most obvious way for other humans to compete with or defend against such goal-oriented artificial agents is to build their own goal-oriented artificial agents.
If at that point we do not know how to correctly specify the goals for such artificial agents (and we haven’t figured out how to stop / compete with / defend against such agents some other way), the universe will end up being directed towards the wrong goals, which may be catastrophic depending on various contingencies such as what the correct meta and normative ethics turn out to be, if the the “incorrect” goals we build into such agents are good enough to capture most of our scalable values, etc.
I briefly looked for and did not find a good citation for this.
Who are you arguing against, or whose argument are you trying to clarify?
I’m not sure. However, I have a lot of conversations where it seems to me that the other person believes the Misspecified Goal Argument. Currently, if I were to meet a MIRI employee I hadn’t met before, I would be unsure whether the Misspecified Goal Argument is their primary reason for worrying about AI risk. If I meet a rationalist who takes the MIRI perspective on AI risk but isn’t at MIRI themselves, by default I assume that their primary reason for caring about AI risk is the Misspecified Goal argument.
I do want to note that I am primarily trying to clarify here, I didn’t write this as an argument against the Misspecified Goal argument. In fact, conditional on the AI having goals, I do agree with the Misspecified Goal argument.
I tend to have a different version of the Misspecified Goal argument in mind which I think doesn’t have this problem
Yeah, I think this is a good argument, and I want to defer to my future post on the topic, which should come out on Wednesday. The TL;DR is that I agree with the argument but it implies a broader space of potential solutions than “figure out how to align a goal-directed AI”.
(Sorry that I didn’t adequately point to different arguments and what I think about them—I didn’t do this because it would make for a very long post, and it’s instead being split into several posts, and this particular argument happens to be in the post on Wednesday.)
My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI’s goal, a situation where a goal concept becomes necessary).
I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI risk in the past. Goal-directed agents are certainly a problem, but not necessarily the solution. They are probably good for fixing astronomical waste, but maybe not AI risk.
I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans’ safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we’d need to solve to create a safe goal-directed agent. I guess in some sense the former has lower “AI risk” than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that’s actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don’t necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can’t derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won’t be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.
When the issue is figuring out which influences of the world to follow, it’s not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it’s not clearly an improvement to restructure that system around a concept of goals, because that won’t move it closer to the influences of the world it’s designed to follow.
This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker.
This makes me think I probably misunderstood what you meant earlier by “agents that are not primarily goal-directed”. Do you have a reference that you can point me to that describes what you have in mind in more detail?
Do you have a citation for this? Who are you arguing against, or whose argument are you trying to clarify?
I tend to have a different version of the Misspecified Goal argument in mind which I think doesn’t have this problem:
At least some humans are goal-directed at least some of the time.
It’s likely possible to create artificial agents that are better at achieving goals than humans.
It will be very tempting for some humans to build such agents in order to help those humans achieve their goals (either instrumental goals or their understanding of their terminal goals).
The most obvious way for other humans to compete with or defend against such goal-oriented artificial agents is to build their own goal-oriented artificial agents.
If at that point we do not know how to correctly specify the goals for such artificial agents (and we haven’t figured out how to stop / compete with / defend against such agents some other way), the universe will end up being directed towards the wrong goals, which may be catastrophic depending on various contingencies such as what the correct meta and normative ethics turn out to be, if the the “incorrect” goals we build into such agents are good enough to capture most of our scalable values, etc.
I briefly looked for and did not find a good citation for this.
I’m not sure. However, I have a lot of conversations where it seems to me that the other person believes the Misspecified Goal Argument. Currently, if I were to meet a MIRI employee I hadn’t met before, I would be unsure whether the Misspecified Goal Argument is their primary reason for worrying about AI risk. If I meet a rationalist who takes the MIRI perspective on AI risk but isn’t at MIRI themselves, by default I assume that their primary reason for caring about AI risk is the Misspecified Goal argument.
I do want to note that I am primarily trying to clarify here, I didn’t write this as an argument against the Misspecified Goal argument. In fact, conditional on the AI having goals, I do agree with the Misspecified Goal argument.
Yeah, I think this is a good argument, and I want to defer to my future post on the topic, which should come out on Wednesday. The TL;DR is that I agree with the argument but it implies a broader space of potential solutions than “figure out how to align a goal-directed AI”.
(Sorry that I didn’t adequately point to different arguments and what I think about them—I didn’t do this because it would make for a very long post, and it’s instead being split into several posts, and this particular argument happens to be in the post on Wednesday.)
My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI’s goal, a situation where a goal concept becomes necessary).
I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI risk in the past. Goal-directed agents are certainly a problem, but not necessarily the solution. They are probably good for fixing astronomical waste, but maybe not AI risk.
I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans’ safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we’d need to solve to create a safe goal-directed agent. I guess in some sense the former has lower “AI risk” than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that’s actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don’t necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can’t derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won’t be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.
When the issue is figuring out which influences of the world to follow, it’s not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it’s not clearly an improvement to restructure that system around a concept of goals, because that won’t move it closer to the influences of the world it’s designed to follow.
This makes me think I probably misunderstood what you meant earlier by “agents that are not primarily goal-directed”. Do you have a reference that you can point me to that describes what you have in mind in more detail?