I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans’ safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we’d need to solve to create a safe goal-directed agent. I guess in some sense the former has lower “AI risk” than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that’s actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don’t necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can’t derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won’t be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.
When the issue is figuring out which influences of the world to follow, it’s not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it’s not clearly an improvement to restructure that system around a concept of goals, because that won’t move it closer to the influences of the world it’s designed to follow.
This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker.
This makes me think I probably misunderstood what you meant earlier by “agents that are not primarily goal-directed”. Do you have a reference that you can point me to that describes what you have in mind in more detail?
I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans’ safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we’d need to solve to create a safe goal-directed agent. I guess in some sense the former has lower “AI risk” than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that’s actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don’t necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can’t derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won’t be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.
When the issue is figuring out which influences of the world to follow, it’s not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it’s not clearly an improvement to restructure that system around a concept of goals, because that won’t move it closer to the influences of the world it’s designed to follow.
This makes me think I probably misunderstood what you meant earlier by “agents that are not primarily goal-directed”. Do you have a reference that you can point me to that describes what you have in mind in more detail?