I’m not TheAncientGeek, but I’m also a proponent of tool / oracle AI, so maybe I can speak to that. The proposals I’ve seen basically break down into two categories:
(1) Assuming the problem of steadfast goals has been solved—what MIRI refers to as highly reliable agents—you build an agent which provides (partial) answers to questions while obeying fixed constraints. The easiest to analyze example would be “Give me a solution to problem X, in the process consuming no more than Y megajoules of energy, then halt.” In this case the AI simply doesn’t have the energy budget to figure out how to trick us into achieving evil goal Z.
(This energy-constrained agent is not the typical example given in arguments for tool AI. More often the constraints are whack-a-mole things like “don’t make irreversible changes to your environment” or “don’t try to increase your hardware capacity”. IMHO this all too often clouds the issue, because it just generates a response of “what about situation Z”, or “what if the AI does blah”.)
(2) Build an agent inside of a box, and watch that box very carefully. E.g. this could be the situation in (1), but with an ampmeter attached to a circuit breaker to enforce the energy constraint (among many, many other devices used to secure and observe the AI). This approach sidesteps the issue of friendliness / alignment entirely. The AI may be unfriendly, but impotent.
As far as I can tell, the issue of “has a goal” vs “does not have a goal” does not enter into the above proposals at all. The only people I’ve seen making that distinction are arguing against tool AI but missing the point. Of course agents have goals, and of course oracle AI has goals—just a more specific range of answer-the-question goals. The point is that oracle / tool AI has imposed constraints which limit their capability to do harm. They’re safe not because of they provably work in our interests, but because they are too unempowered to do us harm.
In the context of Superintelligence, though, in Table 11 a Tool AI is defined thusly: ‘Tool: A system not designed to exhibit goal-directed behaviour.’ I am responding directly to that. But it sounds as though you would object to Bostrom’s characterisation of tool proposals.
In Bostrom’s parlance, I think your proposals for Tool AI would be described as (1) Oracle AI + stunting and (2) Oracle AI + boxing—the energy thing is interesting. I’m hopeful they would be safe, but I’m not convinced it would take much energy to pose an existential threat.
I haven’t read Superintelligence (so maybe I shouldn’t be posting here?), but I have read much of Bostrom earlier work, and yes I object to Bostrom’s prior characterisations of tool proposals, which basically amount to strawman setups. Either that or i misunderstood the original tool AI posts and they really were that bad :\ In my mind at least, “tool AI” is simply a proposal for using oracle AI in a specific way, as a sort of intelligence augmentation for its human operators.
Regarding energy constraints, that’s just one of an entire spectrum of possibilities. Approaching the problem from the other direction leads to suggestions such as cryptographically secure computational traces: you basically demand the AI to provide a subset of its call graph history as an audit log, but in a way that the AI doesn’t know ahead of time what execution pathways will be demanded of it. You then have some sort of audit log checker or manual inspection of the trace and the instrumental goals it reveals. Faking the log would require a detectable amount of work not accounted for in the trace.
It is possible to construct much better boxes than critics of oracle AI would have you believe.
How do energy constraints defeat a strategy of this form:
While energy constraints are low, answer easy questions, but fake failure on some hard questions claiming lack of resources
Once given enough excess energy beyond that needed for a task, use the excess computing power to figure out a way to do unfriendly things
Do cryptographic traces place a limit on the complexity of tasks that you can solve? It seems like as task complexity increases, it will become harder to spot deception amongst legitimate computations (or the amount of human attention you need to achieve a fixed probability of detection scales according to problem complexity).
You can’t fake failure. Whether it responds in time or not, a trace is provided. That trace, if it validates, provides a record of everything the program did (or rather a randomly selected subset, but that subset is not knowable in advance), even if it didn’t lead to a usable answer. If not valid, you terminate and never run again.
To your second paragraph, these are open questions. But that’s precisely my point—I think MIRI should be engaged in researching these sorts of things, and not be biased towards areas relating to their own preconceived notions of what an AGI experiment should look like.
I’m not TheAncientGeek, but I’m also a proponent of tool / oracle AI, so maybe I can speak to that. The proposals I’ve seen basically break down into two categories:
(1) Assuming the problem of steadfast goals has been solved—what MIRI refers to as highly reliable agents—you build an agent which provides (partial) answers to questions while obeying fixed constraints. The easiest to analyze example would be “Give me a solution to problem X, in the process consuming no more than Y megajoules of energy, then halt.” In this case the AI simply doesn’t have the energy budget to figure out how to trick us into achieving evil goal Z.
(This energy-constrained agent is not the typical example given in arguments for tool AI. More often the constraints are whack-a-mole things like “don’t make irreversible changes to your environment” or “don’t try to increase your hardware capacity”. IMHO this all too often clouds the issue, because it just generates a response of “what about situation Z”, or “what if the AI does blah”.)
(2) Build an agent inside of a box, and watch that box very carefully. E.g. this could be the situation in (1), but with an ampmeter attached to a circuit breaker to enforce the energy constraint (among many, many other devices used to secure and observe the AI). This approach sidesteps the issue of friendliness / alignment entirely. The AI may be unfriendly, but impotent.
As far as I can tell, the issue of “has a goal” vs “does not have a goal” does not enter into the above proposals at all. The only people I’ve seen making that distinction are arguing against tool AI but missing the point. Of course agents have goals, and of course oracle AI has goals—just a more specific range of answer-the-question goals. The point is that oracle / tool AI has imposed constraints which limit their capability to do harm. They’re safe not because of they provably work in our interests, but because they are too unempowered to do us harm.
Okay, that’s fair enough.
In the context of Superintelligence, though, in Table 11 a Tool AI is defined thusly: ‘Tool: A system not designed to exhibit goal-directed behaviour.’ I am responding directly to that. But it sounds as though you would object to Bostrom’s characterisation of tool proposals.
In Bostrom’s parlance, I think your proposals for Tool AI would be described as (1) Oracle AI + stunting and (2) Oracle AI + boxing—the energy thing is interesting. I’m hopeful they would be safe, but I’m not convinced it would take much energy to pose an existential threat.
I haven’t read Superintelligence (so maybe I shouldn’t be posting here?), but I have read much of Bostrom earlier work, and yes I object to Bostrom’s prior characterisations of tool proposals, which basically amount to strawman setups. Either that or i misunderstood the original tool AI posts and they really were that bad :\ In my mind at least, “tool AI” is simply a proposal for using oracle AI in a specific way, as a sort of intelligence augmentation for its human operators.
Regarding energy constraints, that’s just one of an entire spectrum of possibilities. Approaching the problem from the other direction leads to suggestions such as cryptographically secure computational traces: you basically demand the AI to provide a subset of its call graph history as an audit log, but in a way that the AI doesn’t know ahead of time what execution pathways will be demanded of it. You then have some sort of audit log checker or manual inspection of the trace and the instrumental goals it reveals. Faking the log would require a detectable amount of work not accounted for in the trace.
It is possible to construct much better boxes than critics of oracle AI would have you believe.
How do energy constraints defeat a strategy of this form: While energy constraints are low, answer easy questions, but fake failure on some hard questions claiming lack of resources Once given enough excess energy beyond that needed for a task, use the excess computing power to figure out a way to do unfriendly things
Do cryptographic traces place a limit on the complexity of tasks that you can solve? It seems like as task complexity increases, it will become harder to spot deception amongst legitimate computations (or the amount of human attention you need to achieve a fixed probability of detection scales according to problem complexity).
You can’t fake failure. Whether it responds in time or not, a trace is provided. That trace, if it validates, provides a record of everything the program did (or rather a randomly selected subset, but that subset is not knowable in advance), even if it didn’t lead to a usable answer. If not valid, you terminate and never run again.
To your second paragraph, these are open questions. But that’s precisely my point—I think MIRI should be engaged in researching these sorts of things, and not be biased towards areas relating to their own preconceived notions of what an AGI experiment should look like.