Tool/Agent distinction in the light of the AI box experiment
This article poses questions on the distinction between Tool AGI and Agent AGI, which was described very concisely by Holden Karnofsky in his recent Thoughts on the Singularity Institute post:
In short, Google Maps is not an agent, taking actions in order to maximize a utility parameter. It is a tool, generating information and then displaying it in a user-friendly manner for me to consider, use and export or discard as I wish.
For me, this instantly raised one question: What if a Tool AGI becomes/is self-aware (which, for the purposes of this post, I define as “able to have goals that are distinct from the goals of the outside world”) and starts manipulating its results in a way that is non-obvious to its user? Or, even worse: What if the Tool AGI makes its user do things (which I do not expect to be much more difficult than succeding in the AI box experiment)?
My first reaction was to flinch away by telling myself: “But of course a Tool would never become self-aware! Self-awareness is too complex to just happen unintentionally!”
But some uncertainty survived and was strenghtened by Eliezer’s reply to Holden:
[Tool AGI] starts sounding much scarier once you try to say something more formal and internally-causal like “Model the user and the universe, predict the degree of correspondence between the user’s model and the universe, and select from among possible explanation-actions on this basis.”
After all, “Self-awareness is too complex to just happen unintentionally!” is just a bunch of English words expressing my personal incredulity. It’s not a valid argument.
So, can we make the argument, that self-awareness will not happen unintentionally?
If we can’t make that argument, can we stop Tool AGIs from potentially becoming a Weak Agent AGI which acts through its human user?
If we can’t do that, how meaningful is the distinction between a Weak Agent AGI (a.k.a. Tool AGI) and an Agent AGI?
For more, see the Tools versus Agents post by Stuart_Armstrong, which points to similar questions.
There’s a book called How to Lie with Maps, and it’s got a little about how to use maps to manipulate people—not just the Soviet maps which were produced by their security people, and therefore obfuscated locations (I didn’t expect the Soviets to resemble Tolkien’s elves, but both had secret cities), but also commercial maps which offer a simple direct route to one’s business, but won’t take you past competitors.
I don’t know how a UFAI could use such powers (perhaps GPS would keep getting anyone who threatened it stuck in traffic so they’d have less time to work on programming), but it’s going to be sneakier than I am.
It helps if you first define what “self-aware” means. Which means that you should probably first define what “aware” means, and then how this “self” concept fits into it. LW has a number of relevant posts on this. Here is one at random.
As EY repeatedly mentioned, any current complicated non-general AI (like the Deep Blue chess player) already does non-obvious things when solving a problem, yet such AIs are not inherently risky.
Your calculator makes you do things when you use it, it makes you use its answer for whatever you need it for. If you happen to misplace a bracket or forget the order of operations, it might make you fail a test. Yet you would not be afraid of a calculator agentizing through self-awareness.
I hope that you now agree that this question is not a useful one to discuss, until you define at least some specific dangers of awareness that are not present in a non-aware algorithm. And my guess is that you will not be able to.
I haven’t spent several years studying philosophy, so defining “self” and “awareness” is probably not something I should do – nor is that necessary. All I assume in the original post is that self-awareness includes being able to have goals that are distinct from the goals of the outside world.
Deep Blue runs software whose “goal” is the goal its developers have worked on: Choose the best move in a game of chess. Deep Blue does (for all we know) not run an AGI which thinks: “Okay, my real goal is X, but as long as I haven’t calculated what I need to do to reach X, I should just act as if I were a normal chess application and calculate the next move as my programmers expect me to do.”
I’m not using “Y makes me do things” as a synonym for “I should do things using Y in order to reach my goal.” I’m using it as a synonym for “Y can execute arbitrary code in my brain.” Remember: “This is a transhuman mind we’re talking about. If it thinks both faster and better than a human, it can probably take over a human mind through a text-only terminal.”
So by self-awareness, do you mean “has a model of itself?” That seems like a perfectly fine thing to have in your Tool AI.
EDIT: I feel like you might want to re-read this. Maybe I’m wrong.
By self-awareness I mean being able to have goals that are distinct from the goals of the outside world. I didn’t think of your interpretation – Illusion of transparency, I guess… I modified the original post accordingly, thanks for pointing it out.
I’m not thinking in terms of self-awareness being a “ghost in the machine”, I’m thinking of it in terms of arising from complexity. So while I might be making a mistake, it’s at least a more subtle one.
Is there some reason a simple program can’t “have goals different from those of the outside world?”