Perhaps we should be clearer on the relationship between agents, search processes and goals. Bostrom’s argument in this section seemed to be that a strong enough search process is more or less a goal, and so (assuming goals are the main interesting feature of agents) ‘tools’ will be much like agents, in the problematic ways. Are search processes and goals basically the same?
Perhaps one could say that an agent in the sense that matters for this discussion is something with a personal identity, a notion of self (in a very loose sense).
Intuitively, it seems that tool AIs are safer because they are much more transparent. When I run a modern general purpose constraint-solver tool, I’m pretty sure that no AI agent will emerge during the search process. When I pause the tool somewhere in the middle of the search and examine its state, I can predict exactly what the next steps are going to be—even though I can hardly predict the ultimate result of the search!
In contrast, the actions of an agent are influenced by its long-term state (it’s “personality”), so its algorithm is not straightforward to predict.
I feel that the only search processes capable of internally generating agents (the thing Bostrom is worried about) are the ones insufficiently transparent (e.g. ones using neural nets).
I went around in circles on ‘goals’ until I decided to be rigorous in thinking naturalistically rather than anthropomorphically, or mentalistically, for want of a better term. It seems to me that a goal ought to correspond to a set of world states, and then, naturalistically, the ‘goal’ of a process might be a set of world states that the process tends to modify the world towards: a random walk would have no goal, or alternatively, its goal would be any possible world. My goals involve world states where my body is comfortable, I am happy, etc.
It depends on what Bostrom means by a search process, but, taking a stab, in this context it would not really be distinct from a goal provided it had an objective. In this framework, Google Maps can be described as having a goal, but it’s pretty prosaic: manipulate the pixels on the user’s screen in a way that represents the shortest route given the inputs. It’s hugely ‘indifferent’ between world states that do not involve changes to those pixels.
I’m not too keen on the distinction between agents and tools made by Holden because, as he says, any process can be described as having a goal—a nuclear explosion can probably be described this way—but in this context a Tool AI would possibly be described as one that is similarly hugely ‘indifferent’ between world states in that it has no tendency to optimise towards them (I’m not that confident that others would be happy with that description).
([Almost] unrelated pedant note: I don’t think utility functions are expressive enough to capture all potentially relevant behaviours and would suggest it’s better to talk more generally of goals: it’s more naturalistic, and makes fewer assumptions about consistency and rationality.)
I’m not too keen on the distinction between agents and tools made by Holden because, as he says, any process can be described as having a goal—a nuclear explosion can probably be described this way—but in this context a Tool AI would possibly be described as one that is similarly hugely ‘indifferent’ between world states in that it has no tendency to optimise towards them (I’m not that confident that others would be happy with that description).
You are tacitly assuming that having goals is sufficient for agency. But what the proponents of the tool argument seem to mean by an agent, as opposed to a tool, is something that will start pursuing goals on boot up.
A nuclear bomb could be described as having goals so long as you are not bothered about adaptive planning towards those goals....which is good reason for caring about adaptive planning.
1) I must admit that I’m a little sad that this came across as tacit: that was in part the point I was trying to make! I don’t feel totally comfortable with the distinction between tools and agents because I think it mostly, and physically, vanishes when you press the start button on the tool, which is much the same as booting the agent. In practice, I can see that something that always pauses and waits for the next input might be understood as not an agent, is that something you might agree with?
My understanding of [at least one variant of] the tool argument is more that a) software tools can be designed that do not exhibit goal-based behaviour, which b) would be good because the instrumental values argument for deadliness would no longer apply. But since anything can be described as having goals (they are just a model of behaviour) the task of evaluating the friendliness of those ‘goals’ would remain. Reducing this just to ‘programs that always pause before the next input’ or somesuch doesn’t seem to match the tool arguments I’ve read. Note: I would be very pleased to have my understanding of this challenged.
Mostly, I am trying to pin down my own confusion about what it means for a physical phenomena to ‘have a goal’, firstly, because goal-directedness is so central to the argument that superintelligence is dangerous, and secondly, because the tool AI objection was the first that came to mind for me.
2) Hmm, this might be splitting hairs, but I think I would prefer to say that a nuclear bomb’s ‘goals’ are limited to a relatively small subset of the world state, which is why it’s much less dangerous than an AI at the existential level. The lack of adaptive planning of a nuclear bomb seems less relevant than its blast radius in evaluating the danger it poses!
EDIT: reading some of your other comments here, I can see that you have given a definition for an agent roughly matching what I said—sorry that missed that! I would still be interested in your response if you have one :)
I’m not TheAncientGeek, but I’m also a proponent of tool / oracle AI, so maybe I can speak to that. The proposals I’ve seen basically break down into two categories:
(1) Assuming the problem of steadfast goals has been solved—what MIRI refers to as highly reliable agents—you build an agent which provides (partial) answers to questions while obeying fixed constraints. The easiest to analyze example would be “Give me a solution to problem X, in the process consuming no more than Y megajoules of energy, then halt.” In this case the AI simply doesn’t have the energy budget to figure out how to trick us into achieving evil goal Z.
(This energy-constrained agent is not the typical example given in arguments for tool AI. More often the constraints are whack-a-mole things like “don’t make irreversible changes to your environment” or “don’t try to increase your hardware capacity”. IMHO this all too often clouds the issue, because it just generates a response of “what about situation Z”, or “what if the AI does blah”.)
(2) Build an agent inside of a box, and watch that box very carefully. E.g. this could be the situation in (1), but with an ampmeter attached to a circuit breaker to enforce the energy constraint (among many, many other devices used to secure and observe the AI). This approach sidesteps the issue of friendliness / alignment entirely. The AI may be unfriendly, but impotent.
As far as I can tell, the issue of “has a goal” vs “does not have a goal” does not enter into the above proposals at all. The only people I’ve seen making that distinction are arguing against tool AI but missing the point. Of course agents have goals, and of course oracle AI has goals—just a more specific range of answer-the-question goals. The point is that oracle / tool AI has imposed constraints which limit their capability to do harm. They’re safe not because of they provably work in our interests, but because they are too unempowered to do us harm.
In the context of Superintelligence, though, in Table 11 a Tool AI is defined thusly: ‘Tool: A system not designed to exhibit goal-directed behaviour.’ I am responding directly to that. But it sounds as though you would object to Bostrom’s characterisation of tool proposals.
In Bostrom’s parlance, I think your proposals for Tool AI would be described as (1) Oracle AI + stunting and (2) Oracle AI + boxing—the energy thing is interesting. I’m hopeful they would be safe, but I’m not convinced it would take much energy to pose an existential threat.
I haven’t read Superintelligence (so maybe I shouldn’t be posting here?), but I have read much of Bostrom earlier work, and yes I object to Bostrom’s prior characterisations of tool proposals, which basically amount to strawman setups. Either that or i misunderstood the original tool AI posts and they really were that bad :\ In my mind at least, “tool AI” is simply a proposal for using oracle AI in a specific way, as a sort of intelligence augmentation for its human operators.
Regarding energy constraints, that’s just one of an entire spectrum of possibilities. Approaching the problem from the other direction leads to suggestions such as cryptographically secure computational traces: you basically demand the AI to provide a subset of its call graph history as an audit log, but in a way that the AI doesn’t know ahead of time what execution pathways will be demanded of it. You then have some sort of audit log checker or manual inspection of the trace and the instrumental goals it reveals. Faking the log would require a detectable amount of work not accounted for in the trace.
It is possible to construct much better boxes than critics of oracle AI would have you believe.
How do energy constraints defeat a strategy of this form:
While energy constraints are low, answer easy questions, but fake failure on some hard questions claiming lack of resources
Once given enough excess energy beyond that needed for a task, use the excess computing power to figure out a way to do unfriendly things
Do cryptographic traces place a limit on the complexity of tasks that you can solve? It seems like as task complexity increases, it will become harder to spot deception amongst legitimate computations (or the amount of human attention you need to achieve a fixed probability of detection scales according to problem complexity).
You can’t fake failure. Whether it responds in time or not, a trace is provided. That trace, if it validates, provides a record of everything the program did (or rather a randomly selected subset, but that subset is not knowable in advance), even if it didn’t lead to a usable answer. If not valid, you terminate and never run again.
To your second paragraph, these are open questions. But that’s precisely my point—I think MIRI should be engaged in researching these sorts of things, and not be biased towards areas relating to their own preconceived notions of what an AGI experiment should look like.
I can see that something that always pauses and waits for the next input might be understood as not an agent, is that something you might agree with?
Yes, I take a tool to be something that always waits, that defaults to doing nothing..
If everything has a goal, and goals are dangerous, everything is dangerous. Which is a false conclusion. So there must be a false assumption leading to it. Such as all systems having goals.
The kinds of danger MIRI is worrying about come from the way goals are achieved, eg from instrumental convergence, so MIRI shouldn’t be worrying about goals absent adaptive strategies for achieving them, and in fact it hard to see what is gained from talking in those terms.
I also disagree with that false conclusion, but I would probably say that ‘goals are dangerous’ is the false premise. Goals are dangerous when, well, they actually are dangerous (to my life or yours,) and when they are attached to sufficient optimising power, as you get at in your last paragraph.
I think the line of argumentation Bostrom is taking here is that superintelligence by definition has a huge amount of optimisation power, so whether it is dangerous to us is reduced to whether its goals are dangerous to us.
MIRIs argument, which I agree with for once, is that a safe goal can have dangerous sub goals.
The tool AI proponents argument, as I understand it, is that a system that defaults to doing nothing is safer.
I think MIRI types are persistently mishearing that, because they have an entirely different set of presuppositions....that safety is all-or-nothing, not a series of mitigations. That safety is not a matter of engineering, but mathematical proof....not that you can prove anything behind the point where the uncertainty within the system is less than the uncertainty about the system.
You’d have to clarify what you mean by “a huge amount of optimization power.” I can imagine plenty of better-than-human intelligences which nevertheless would not have the capability to pose a significant threat to humanity.
Perhaps we should be clearer on the relationship between agents, search processes and goals. Bostrom’s argument in this section seemed to be that a strong enough search process is more or less a goal, and so (assuming goals are the main interesting feature of agents) ‘tools’ will be much like agents, in the problematic ways. Are search processes and goals basically the same?
Perhaps one could say that an agent in the sense that matters for this discussion is something with a personal identity, a notion of self (in a very loose sense).
Intuitively, it seems that tool AIs are safer because they are much more transparent. When I run a modern general purpose constraint-solver tool, I’m pretty sure that no AI agent will emerge during the search process. When I pause the tool somewhere in the middle of the search and examine its state, I can predict exactly what the next steps are going to be—even though I can hardly predict the ultimate result of the search!
In contrast, the actions of an agent are influenced by its long-term state (it’s “personality”), so its algorithm is not straightforward to predict.
I feel that the only search processes capable of internally generating agents (the thing Bostrom is worried about) are the ones insufficiently transparent (e.g. ones using neural nets).
All this seems to be more or less well explained under Optimization process and Really powerful optimization process, but I’ll give my take on it, heavily borrowed from those and related readings.
I went around in circles on ‘goals’ until I decided to be rigorous in thinking naturalistically rather than anthropomorphically, or mentalistically, for want of a better term. It seems to me that a goal ought to correspond to a set of world states, and then, naturalistically, the ‘goal’ of a process might be a set of world states that the process tends to modify the world towards: a random walk would have no goal, or alternatively, its goal would be any possible world. My goals involve world states where my body is comfortable, I am happy, etc.
It depends on what Bostrom means by a search process, but, taking a stab, in this context it would not really be distinct from a goal provided it had an objective. In this framework, Google Maps can be described as having a goal, but it’s pretty prosaic: manipulate the pixels on the user’s screen in a way that represents the shortest route given the inputs. It’s hugely ‘indifferent’ between world states that do not involve changes to those pixels.
I’m not too keen on the distinction between agents and tools made by Holden because, as he says, any process can be described as having a goal—a nuclear explosion can probably be described this way—but in this context a Tool AI would possibly be described as one that is similarly hugely ‘indifferent’ between world states in that it has no tendency to optimise towards them (I’m not that confident that others would be happy with that description).
([Almost] unrelated pedant note: I don’t think utility functions are expressive enough to capture all potentially relevant behaviours and would suggest it’s better to talk more generally of goals: it’s more naturalistic, and makes fewer assumptions about consistency and rationality.)
You are tacitly assuming that having goals is sufficient for agency. But what the proponents of the tool argument seem to mean by an agent, as opposed to a tool, is something that will start pursuing goals on boot up.
A nuclear bomb could be described as having goals so long as you are not bothered about adaptive planning towards those goals....which is good reason for caring about adaptive planning.
1) I must admit that I’m a little sad that this came across as tacit: that was in part the point I was trying to make! I don’t feel totally comfortable with the distinction between tools and agents because I think it mostly, and physically, vanishes when you press the start button on the tool, which is much the same as booting the agent. In practice, I can see that something that always pauses and waits for the next input might be understood as not an agent, is that something you might agree with?
My understanding of [at least one variant of] the tool argument is more that a) software tools can be designed that do not exhibit goal-based behaviour, which b) would be good because the instrumental values argument for deadliness would no longer apply. But since anything can be described as having goals (they are just a model of behaviour) the task of evaluating the friendliness of those ‘goals’ would remain. Reducing this just to ‘programs that always pause before the next input’ or somesuch doesn’t seem to match the tool arguments I’ve read. Note: I would be very pleased to have my understanding of this challenged.
Mostly, I am trying to pin down my own confusion about what it means for a physical phenomena to ‘have a goal’, firstly, because goal-directedness is so central to the argument that superintelligence is dangerous, and secondly, because the tool AI objection was the first that came to mind for me.
2) Hmm, this might be splitting hairs, but I think I would prefer to say that a nuclear bomb’s ‘goals’ are limited to a relatively small subset of the world state, which is why it’s much less dangerous than an AI at the existential level. The lack of adaptive planning of a nuclear bomb seems less relevant than its blast radius in evaluating the danger it poses!
EDIT: reading some of your other comments here, I can see that you have given a definition for an agent roughly matching what I said—sorry that missed that! I would still be interested in your response if you have one :)
I’m not TheAncientGeek, but I’m also a proponent of tool / oracle AI, so maybe I can speak to that. The proposals I’ve seen basically break down into two categories:
(1) Assuming the problem of steadfast goals has been solved—what MIRI refers to as highly reliable agents—you build an agent which provides (partial) answers to questions while obeying fixed constraints. The easiest to analyze example would be “Give me a solution to problem X, in the process consuming no more than Y megajoules of energy, then halt.” In this case the AI simply doesn’t have the energy budget to figure out how to trick us into achieving evil goal Z.
(This energy-constrained agent is not the typical example given in arguments for tool AI. More often the constraints are whack-a-mole things like “don’t make irreversible changes to your environment” or “don’t try to increase your hardware capacity”. IMHO this all too often clouds the issue, because it just generates a response of “what about situation Z”, or “what if the AI does blah”.)
(2) Build an agent inside of a box, and watch that box very carefully. E.g. this could be the situation in (1), but with an ampmeter attached to a circuit breaker to enforce the energy constraint (among many, many other devices used to secure and observe the AI). This approach sidesteps the issue of friendliness / alignment entirely. The AI may be unfriendly, but impotent.
As far as I can tell, the issue of “has a goal” vs “does not have a goal” does not enter into the above proposals at all. The only people I’ve seen making that distinction are arguing against tool AI but missing the point. Of course agents have goals, and of course oracle AI has goals—just a more specific range of answer-the-question goals. The point is that oracle / tool AI has imposed constraints which limit their capability to do harm. They’re safe not because of they provably work in our interests, but because they are too unempowered to do us harm.
Okay, that’s fair enough.
In the context of Superintelligence, though, in Table 11 a Tool AI is defined thusly: ‘Tool: A system not designed to exhibit goal-directed behaviour.’ I am responding directly to that. But it sounds as though you would object to Bostrom’s characterisation of tool proposals.
In Bostrom’s parlance, I think your proposals for Tool AI would be described as (1) Oracle AI + stunting and (2) Oracle AI + boxing—the energy thing is interesting. I’m hopeful they would be safe, but I’m not convinced it would take much energy to pose an existential threat.
I haven’t read Superintelligence (so maybe I shouldn’t be posting here?), but I have read much of Bostrom earlier work, and yes I object to Bostrom’s prior characterisations of tool proposals, which basically amount to strawman setups. Either that or i misunderstood the original tool AI posts and they really were that bad :\ In my mind at least, “tool AI” is simply a proposal for using oracle AI in a specific way, as a sort of intelligence augmentation for its human operators.
Regarding energy constraints, that’s just one of an entire spectrum of possibilities. Approaching the problem from the other direction leads to suggestions such as cryptographically secure computational traces: you basically demand the AI to provide a subset of its call graph history as an audit log, but in a way that the AI doesn’t know ahead of time what execution pathways will be demanded of it. You then have some sort of audit log checker or manual inspection of the trace and the instrumental goals it reveals. Faking the log would require a detectable amount of work not accounted for in the trace.
It is possible to construct much better boxes than critics of oracle AI would have you believe.
How do energy constraints defeat a strategy of this form: While energy constraints are low, answer easy questions, but fake failure on some hard questions claiming lack of resources Once given enough excess energy beyond that needed for a task, use the excess computing power to figure out a way to do unfriendly things
Do cryptographic traces place a limit on the complexity of tasks that you can solve? It seems like as task complexity increases, it will become harder to spot deception amongst legitimate computations (or the amount of human attention you need to achieve a fixed probability of detection scales according to problem complexity).
You can’t fake failure. Whether it responds in time or not, a trace is provided. That trace, if it validates, provides a record of everything the program did (or rather a randomly selected subset, but that subset is not knowable in advance), even if it didn’t lead to a usable answer. If not valid, you terminate and never run again.
To your second paragraph, these are open questions. But that’s precisely my point—I think MIRI should be engaged in researching these sorts of things, and not be biased towards areas relating to their own preconceived notions of what an AGI experiment should look like.
Yes, I take a tool to be something that always waits, that defaults to doing nothing..
If everything has a goal, and goals are dangerous, everything is dangerous. Which is a false conclusion. So there must be a false assumption leading to it. Such as all systems having goals.
The kinds of danger MIRI is worrying about come from the way goals are achieved, eg from instrumental convergence, so MIRI shouldn’t be worrying about goals absent adaptive strategies for achieving them, and in fact it hard to see what is gained from talking in those terms.
I also disagree with that false conclusion, but I would probably say that ‘goals are dangerous’ is the false premise. Goals are dangerous when, well, they actually are dangerous (to my life or yours,) and when they are attached to sufficient optimising power, as you get at in your last paragraph.
I think the line of argumentation Bostrom is taking here is that superintelligence by definition has a huge amount of optimisation power, so whether it is dangerous to us is reduced to whether its goals are dangerous to us.
(Happy New Year!)
MIRIs argument, which I agree with for once, is that a safe goal can have dangerous sub goals.
The tool AI proponents argument, as I understand it, is that a system that defaults to doing nothing is safer.
I think MIRI types are persistently mishearing that, because they have an entirely different set of presuppositions....that safety is all-or-nothing, not a series of mitigations. That safety is not a matter of engineering, but mathematical proof....not that you can prove anything behind the point where the uncertainty within the system is less than the uncertainty about the system.
You’d have to clarify what you mean by “a huge amount of optimization power.” I can imagine plenty of better-than-human intelligences which nevertheless would not have the capability to pose a significant threat to humanity.
Assuming that acting without being prompted is the interesting feature of agents, tools won’t be very like agents.
What are you searching for? That’s the goal.