My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI’s goal, a situation where a goal concept becomes necessary).
I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI risk in the past. Goal-directed agents are certainly a problem, but not necessarily the solution. They are probably good for fixing astronomical waste, but maybe not AI risk.
I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans’ safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we’d need to solve to create a safe goal-directed agent. I guess in some sense the former has lower “AI risk” than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that’s actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don’t necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can’t derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won’t be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.
When the issue is figuring out which influences of the world to follow, it’s not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it’s not clearly an improvement to restructure that system around a concept of goals, because that won’t move it closer to the influences of the world it’s designed to follow.
This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker.
This makes me think I probably misunderstood what you meant earlier by “agents that are not primarily goal-directed”. Do you have a reference that you can point me to that describes what you have in mind in more detail?
My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI’s goal, a situation where a goal concept becomes necessary).
I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI risk in the past. Goal-directed agents are certainly a problem, but not necessarily the solution. They are probably good for fixing astronomical waste, but maybe not AI risk.
I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans’ safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we’d need to solve to create a safe goal-directed agent. I guess in some sense the former has lower “AI risk” than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that’s actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don’t necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can’t derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won’t be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.
When the issue is figuring out which influences of the world to follow, it’s not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don’t pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it’s not clearly an improvement to restructure that system around a concept of goals, because that won’t move it closer to the influences of the world it’s designed to follow.
This makes me think I probably misunderstood what you meant earlier by “agents that are not primarily goal-directed”. Do you have a reference that you can point me to that describes what you have in mind in more detail?