Any time you have a search process (and, let’s be real, most of the things we think of as “smart” are search problems), you are setting a target but not specifying how to get there. I think the important sense of the word “agent” in this context is that it’s a process that searches for an output based on the modeled consequences of that output.
For example, if you want to colonize the upper atmosphere of Venus, one approach is to make an AI that evaluates outputs (e.g. text outputs of persuasive arguments and technical proposals) based on some combined metric of how much Venus gets colonized and how much it costs. Because it evaluates outputs based on their consequences, it’s going to act like an agent that wants to pursue its utility function at the expense of everything else.
Call the above output “the plan”—you can make a “tool AI” that still outputs the plan without being an agent!
Just make it so that the plan is merely part of the output—the rest is composed according to some subprogram that humans have designed for elucidating the reasons the AI chose that output (call this the “explanation”). The AI predicts the results as if its output was only the plan, but what humans see is both the plan and the explanation, so it’s no longer fulfilling the criterion for agency above.
In this example, the plan is a bad idea in both cases—the thing you programmed the AI to search for is probably something that’s bad for humanity when taken to an extreme. It’s just that in the “tool AI” case, you’ve added some extra non-search-optimized output that you hope undoes some of the work of the search process.
Making your search process into a tool by adding the reason-elucidator hopefully made it less disastrously bad, but it didn’t actually get you a good plan. The problems that you need to solve to get a superhumanly good plan are in fact the same problems you’d need to solve to make the agent safe.
(Sidenote: This can be worked around by giving your tool AI a simplified model of the world and then relying on humans to un-simplify the resulting plan, much like Google Maps makes a plan in an extremely simplified model of the world and then you follow something that sort of looks like that plan. This workaround fails when the task of un-simplifying the plan becomes superhumanly difficult, i.e. right around when things get really interesting, which is why imagining a Google-Maps-like list of safe abstract instructions might be building a false intuition.)
In short, to actually find out the superintelligently awesome plan to solve a problem, you have to have a search process that’s looking for the plan you want. Since this sounds a lot like an agent, and an unfriendly agent is one of the cases we’re most concerned about, it’s easy and common to frame this in terms of an agent.
Any time you have a search process (and, let’s be real, most of the things we think of as “smart” are search problems), you are setting a target but not specifying how to get there. I think the important sense of the word “agent” in this context is that it’s a process that searches for an output based on the modeled consequences of that output.
For example, if you want to colonize the upper atmosphere of Venus, one approach is to make an AI that evaluates outputs (e.g. text outputs of persuasive arguments and technical proposals) based on some combined metric of how much Venus gets colonized and how much it costs. Because it evaluates outputs based on their consequences, it’s going to act like an agent that wants to pursue its utility function at the expense of everything else.
Call the above output “the plan”—you can make a “tool AI” that still outputs the plan without being an agent!
Just make it so that the plan is merely part of the output—the rest is composed according to some subprogram that humans have designed for elucidating the reasons the AI chose that output (call this the “explanation”). The AI predicts the results as if its output was only the plan, but what humans see is both the plan and the explanation, so it’s no longer fulfilling the criterion for agency above.
In this example, the plan is a bad idea in both cases—the thing you programmed the AI to search for is probably something that’s bad for humanity when taken to an extreme. It’s just that in the “tool AI” case, you’ve added some extra non-search-optimized output that you hope undoes some of the work of the search process.
Making your search process into a tool by adding the reason-elucidator hopefully made it less disastrously bad, but it didn’t actually get you a good plan. The problems that you need to solve to get a superhumanly good plan are in fact the same problems you’d need to solve to make the agent safe.
In short, to actually find out the superintelligently awesome plan to solve a problem, you have to have a search process that’s looking for the plan you want. Since this sounds a lot like an agent, and an unfriendly agent is one of the cases we’re most concerned about, it’s easy and common to frame this in terms of an agent.