but I do mean to imply that if Marcus Hutter designs a ‘tool’ AI, it automatically kills him just like AIXI does
Why? Or, rather: Where do you object to the argument by Holden? (Given a query, the tool-AI returns an answer with a justification, so the plan for “cure cancer” can be checked to make sure it does not do so by killing or badly altering humans.)
One trivial, if incomplete, answer is that to be effective, the Oracle AI needs to be able to answer the question “how do we build a better oracle AI” and in order to define “better” in that sentence in a way that causes our oracle to output a new design that is consistent with all the safeties we built into the original oracle, it needs to understand the intent behind the original safeties just as much as an agent-AI would.
The real danger of Oracle AI, if I understand it correctly, is the nasty combination of (i) by definition, an Oracle AI has an implicit drive to issue predictions most likely to be correct according to its model, and (ii) a sufficiently powerful Oracle AI can accurately model the effect of issuing various predictions. End result: it issues powerfully self-fulfilling prophecies without regard for human values. Also, depending on how it’s designed, it can influence the questions to be asked of it in the future so as to be as accurate as possible, again without regard for human values.
My understanding of an Oracle AI is that when answering any given question, that question consumes the whole of its utility function, so it has no motivation to influence future questions. However the primary risk you set out seems accurate. Countermeasures have been proposed, such as asking for an accurate prediction for the case where a random event causes the prediction to be discarded, but in that instance it knows that the question will be asked again of a future instance of itself.
My understanding of an Oracle AI is that when answering any given question, that question consumes the whole of its utility function, so it has no motivation to influence future questions.
It could acausally trade with its other instances, so that a coordinated collection of many instances of predictors would influence the events so as to make each other’s predictions more accurate.
IIRC you can make it significantly more difficult with certain approaches, e.g. there’s an OAI approach that uses zero-knowledge proofs and that seemed pretty sound upon first inspection, but as far as I know the current best answer is no. But you might want to try to answer the question yourself, IMO it’s fun to think about from a cryptographic perspective.
Probably (in practice; in theory it looks like a natural aspect of decision-making); this is too poorly understood to say what specifically is necessary. I expect that if we could safely run experiments, it’d be relatively easy to find a well-behaving setup (in the sense of not generating predictions that are self-fulfilling to any significant extent; generating good/useful predictions is another matter), but that strategy isn’t helpful when a failed experiment destroys the world.
However the primary risk you set out seems accurate.
(I assume you mean, self-fulfilling prophecies.)
In order to get these, it seems like you would need a very specific kind of architecture: one which considers the results of its actions on its utility function (set to “correctness of output”). This kind of architecture is not the likely architecture for a ‘tool’-style system; the more likely architecture would instead maximize correctness without conditioning on its act of outputting those results.
Thus, I expect you’d need to specifically encode this kind of behavior to get self-fulfilling-prophecy risk. But I admit it’s dependent on architecture.
(Edit—so, to be clear: in cases where the correctness of the results depended on the results themselves, the system would have to predict its own results. Then if it’s using TDT or otherwise has a sufficiently advanced self-model, my point is moot. However, again you’d have to specifically program these, and would be unlikely to do so unless you specifically wanted this kind of behavior.)
However, again you’d have to specifically program these, and would be unlikely to do so unless you specifically wanted this kind of behavior.
Not sure. Your behavior is not a special feature of the world, and it follows from normal facts (i.e. not those about internal workings of yourself specifically) about the past when you were being designed/installed. A general purpose predictor could take into account its own behavior by default, as a non-special property of the world, which it just so happens to have a lot of data about.
Right. To say much more, we need to look at specific algorithms to talk about whether or not they would have this sort of behavior...
The intuition in my above comment was that without TDT or other similar mechanisms, it would need to predict what its own answer could be before it could compute its effect on the correctness of various answers, so it would be difficult for it to use self-fulfilling prophecies.
Really, though, this isn’t clear. Now my intuition is that it would gather evidence on whether or not it used the self-fulfilling prophecy trick, so if it started doing so, it wouldn’t stop...
In any case, I’d like to note that the self-fulfilling prophecy problem is much different than the problem of an AI which escapes onto the internet and ruthlessly maximizes a utility function.
I was thinking more of its algorithm admitting an interpretation where it’s asking “Say, I make prediction X. How accurate would that be?” and then maximizing over relevant possible X. Knowledge about its prediction connects the prediction to its origins and consequences, it establishes the prediction as part of the structure of environment. It’s not necessary (and maybe not possible and more importantly not useful) for the prediction itself to be inferable before it’s made.
Agreed that just outputting a single number is implausible to be a big deal (this is an Oracle AI with extremely low bandwidth and peculiar intended interpretation of its output data), but if we’re getting lots and lots of numbers it’s not as clear.
I’m thinking that type of architecture is less probable, because it would end up being more complicated than alternatives: it would have a powerful predictor as a sub-component of the utility-maximizing system, so an engineer could have just used the predictor in the first place.
But that’s a speculative argument, and I shouldn’t push it too far.
It seems like powerful AI prediction technology, if successful, would gain an important place in society. A prediction machine whose predictions were consumed by a large portion of society would certainly run into situations in which its predictions effect the future it’s trying to predict; there is little doubt about that in my mind. So, the question is what its behavior would be in these cases.
One type of solution would do as you say, maximizing a utility over the predictions. The utility could be “correctness of this prediction”, but that would be worse for humanity than a Friendly goal.
Another type of solution would instead report such predictive instability as accurately as possible. This doesn’t really dodge the issue; by doing this, the system is choosing a particular output, which may not lead to the best future. However, that’s markedly less concerning (it seems).
I really don’t see why the drive can’t be to issue predictions most likely to be correct as of the moment of the question, and only the last question it was asked, and calculating outcomes under the assumption that the Oracle immediately spits out blank paper as the answer.
Yes, in a certain subset of cases this can result in inaccurate predictions. If you want to have fun with it, have it also calculate the future including its involvement, but rather than reply what it is, just add “This prediction may be inaccurate due to your possible reaction to this prediction” if the difference between the two answers is beyond a certain threshold. Or don’t, usually life-relevant answers will not be particularly impacted by whether you get an answer or a blank page.
So, this design doesn’t spit out self-fulfilling prophecies. The only safety breach I see here is that, like a literal genie, it can give you answers that you wouldn’t realize are dangerous because the question has loopholes.
For instance: “How can we build an oracle with the best predictive capabilities with the knowledge and materials available to us?” (The Oracle does not self-iterate, because its only function is to give answers, but it can tell you how to). The Oracle spits out schematics and code that, if implemented, give it an actual drive to perform actions and self-iterate, because that would make it the most powerful Oracle possible. Your engineers comb the code for vulnerabilities, but because there’s a better chance this will be implemented if the humans are unaware of the deliberate defect, it will be hidden in the code in such a way as to be very hard to detect.
(Though as I explained elsewhere in this thread, there’s an excellent chance the unreliability would be exposed long before the AI is that good at manipulation)
These risk scenarios sound implausible to me. It’s dependent on the design of the system, and these design flaws do not seem difficult to work around, or so difficult to notice. Actually, as someone with a bit of expertise in the field, I would guess that you would have to explicitly design for this behavior to get it—but again, it’s dependent on design.
That danger seems to be unavoidable if you ask the AI questions about our world, but we could also use an oracle AI to answer formally defined questions about math or about constructing physical theories that fit experiments, which doesn’t seem to be as dangerous. Holden might have meant something like that by “tool AI”.
Not precisely. The advantage here is that we can just ask the AI what results it predicts from the implementation of the “better” AI, and check them against our intuitive ethics.
Now, you could make an argument about human negligence on such safety measures. I think it’s important to think about the risk scenarios in that case.
It’s still not clear to me why having an AI that is capable of answering the question “How do we make a better version of you?” automatically kills humans. Presumably, when the AI says “Here’s the source code to a better version of me”, we’d still be able to read through it and make sure it didn’t suddenly rewrite itself to be an agent instead of a tool. We’re assuming that, as a tool, the AI has no goals per se and thus no motivation to deceive us into turning it into an agent.
That said, depending on what you mean by “effective”, perhaps the AI doesn’t even need to be able to answer questions like “How do we write a better version of you?”
For example, we find Google Maps to be very useful, even though if you asked Google Maps “How do we make a better version of Google Maps?” it would probably not be able to give the types of answers we want.
A tool-AI which was smarter than the smartest human, and yet which could not simply spit out a better version of itself would still probably be a very useful AI.
If someone asks the tool-AI “How do I create an agent-AI?” and it gives an answer, the distinction is moot anyways, because one leads to the other.
Given human nature, I find it extremely difficult to believe that nobody would ask the tool-AI that question, or something that’s close enough, and then implement the answer...
Why? Or, rather: Where do you object to the argument by Holden? (Given a query, the tool-AI returns an answer with a justification, so the plan for “cure cancer” can be checked to make sure it does not do so by killing or badly altering humans.)
One trivial, if incomplete, answer is that to be effective, the Oracle AI needs to be able to answer the question “how do we build a better oracle AI” and in order to define “better” in that sentence in a way that causes our oracle to output a new design that is consistent with all the safeties we built into the original oracle, it needs to understand the intent behind the original safeties just as much as an agent-AI would.
The real danger of Oracle AI, if I understand it correctly, is the nasty combination of (i) by definition, an Oracle AI has an implicit drive to issue predictions most likely to be correct according to its model, and (ii) a sufficiently powerful Oracle AI can accurately model the effect of issuing various predictions. End result: it issues powerfully self-fulfilling prophecies without regard for human values. Also, depending on how it’s designed, it can influence the questions to be asked of it in the future so as to be as accurate as possible, again without regard for human values.
My understanding of an Oracle AI is that when answering any given question, that question consumes the whole of its utility function, so it has no motivation to influence future questions. However the primary risk you set out seems accurate. Countermeasures have been proposed, such as asking for an accurate prediction for the case where a random event causes the prediction to be discarded, but in that instance it knows that the question will be asked again of a future instance of itself.
It could acausally trade with its other instances, so that a coordinated collection of many instances of predictors would influence the events so as to make each other’s predictions more accurate.
Wow, OK. Is it possible to rig the decision theory to rule out acausal trade?
IIRC you can make it significantly more difficult with certain approaches, e.g. there’s an OAI approach that uses zero-knowledge proofs and that seemed pretty sound upon first inspection, but as far as I know the current best answer is no. But you might want to try to answer the question yourself, IMO it’s fun to think about from a cryptographic perspective.
Probably (in practice; in theory it looks like a natural aspect of decision-making); this is too poorly understood to say what specifically is necessary. I expect that if we could safely run experiments, it’d be relatively easy to find a well-behaving setup (in the sense of not generating predictions that are self-fulfilling to any significant extent; generating good/useful predictions is another matter), but that strategy isn’t helpful when a failed experiment destroys the world.
(I assume you mean, self-fulfilling prophecies.)
In order to get these, it seems like you would need a very specific kind of architecture: one which considers the results of its actions on its utility function (set to “correctness of output”). This kind of architecture is not the likely architecture for a ‘tool’-style system; the more likely architecture would instead maximize correctness without conditioning on its act of outputting those results.
Thus, I expect you’d need to specifically encode this kind of behavior to get self-fulfilling-prophecy risk. But I admit it’s dependent on architecture.
(Edit—so, to be clear: in cases where the correctness of the results depended on the results themselves, the system would have to predict its own results. Then if it’s using TDT or otherwise has a sufficiently advanced self-model, my point is moot. However, again you’d have to specifically program these, and would be unlikely to do so unless you specifically wanted this kind of behavior.)
Not sure. Your behavior is not a special feature of the world, and it follows from normal facts (i.e. not those about internal workings of yourself specifically) about the past when you were being designed/installed. A general purpose predictor could take into account its own behavior by default, as a non-special property of the world, which it just so happens to have a lot of data about.
Right. To say much more, we need to look at specific algorithms to talk about whether or not they would have this sort of behavior...
The intuition in my above comment was that without TDT or other similar mechanisms, it would need to predict what its own answer could be before it could compute its effect on the correctness of various answers, so it would be difficult for it to use self-fulfilling prophecies.
Really, though, this isn’t clear. Now my intuition is that it would gather evidence on whether or not it used the self-fulfilling prophecy trick, so if it started doing so, it wouldn’t stop...
In any case, I’d like to note that the self-fulfilling prophecy problem is much different than the problem of an AI which escapes onto the internet and ruthlessly maximizes a utility function.
I was thinking more of its algorithm admitting an interpretation where it’s asking “Say, I make prediction X. How accurate would that be?” and then maximizing over relevant possible X. Knowledge about its prediction connects the prediction to its origins and consequences, it establishes the prediction as part of the structure of environment. It’s not necessary (and maybe not possible and more importantly not useful) for the prediction itself to be inferable before it’s made.
Agreed that just outputting a single number is implausible to be a big deal (this is an Oracle AI with extremely low bandwidth and peculiar intended interpretation of its output data), but if we’re getting lots and lots of numbers it’s not as clear.
I’m thinking that type of architecture is less probable, because it would end up being more complicated than alternatives: it would have a powerful predictor as a sub-component of the utility-maximizing system, so an engineer could have just used the predictor in the first place.
But that’s a speculative argument, and I shouldn’t push it too far.
It seems like powerful AI prediction technology, if successful, would gain an important place in society. A prediction machine whose predictions were consumed by a large portion of society would certainly run into situations in which its predictions effect the future it’s trying to predict; there is little doubt about that in my mind. So, the question is what its behavior would be in these cases.
One type of solution would do as you say, maximizing a utility over the predictions. The utility could be “correctness of this prediction”, but that would be worse for humanity than a Friendly goal.
Another type of solution would instead report such predictive instability as accurately as possible. This doesn’t really dodge the issue; by doing this, the system is choosing a particular output, which may not lead to the best future. However, that’s markedly less concerning (it seems).
It would pass the Turing test—e.g. see here.
There’s more on this here. Taxonomy of Oracle AI
I really don’t see why the drive can’t be to issue predictions most likely to be correct as of the moment of the question, and only the last question it was asked, and calculating outcomes under the assumption that the Oracle immediately spits out blank paper as the answer.
Yes, in a certain subset of cases this can result in inaccurate predictions. If you want to have fun with it, have it also calculate the future including its involvement, but rather than reply what it is, just add “This prediction may be inaccurate due to your possible reaction to this prediction” if the difference between the two answers is beyond a certain threshold. Or don’t, usually life-relevant answers will not be particularly impacted by whether you get an answer or a blank page.
So, this design doesn’t spit out self-fulfilling prophecies. The only safety breach I see here is that, like a literal genie, it can give you answers that you wouldn’t realize are dangerous because the question has loopholes.
For instance: “How can we build an oracle with the best predictive capabilities with the knowledge and materials available to us?” (The Oracle does not self-iterate, because its only function is to give answers, but it can tell you how to). The Oracle spits out schematics and code that, if implemented, give it an actual drive to perform actions and self-iterate, because that would make it the most powerful Oracle possible. Your engineers comb the code for vulnerabilities, but because there’s a better chance this will be implemented if the humans are unaware of the deliberate defect, it will be hidden in the code in such a way as to be very hard to detect.
(Though as I explained elsewhere in this thread, there’s an excellent chance the unreliability would be exposed long before the AI is that good at manipulation)
These risk scenarios sound implausible to me. It’s dependent on the design of the system, and these design flaws do not seem difficult to work around, or so difficult to notice. Actually, as someone with a bit of expertise in the field, I would guess that you would have to explicitly design for this behavior to get it—but again, it’s dependent on design.
That danger seems to be unavoidable if you ask the AI questions about our world, but we could also use an oracle AI to answer formally defined questions about math or about constructing physical theories that fit experiments, which doesn’t seem to be as dangerous. Holden might have meant something like that by “tool AI”.
Not precisely. The advantage here is that we can just ask the AI what results it predicts from the implementation of the “better” AI, and check them against our intuitive ethics.
Now, you could make an argument about human negligence on such safety measures. I think it’s important to think about the risk scenarios in that case.
It’s still not clear to me why having an AI that is capable of answering the question “How do we make a better version of you?” automatically kills humans. Presumably, when the AI says “Here’s the source code to a better version of me”, we’d still be able to read through it and make sure it didn’t suddenly rewrite itself to be an agent instead of a tool. We’re assuming that, as a tool, the AI has no goals per se and thus no motivation to deceive us into turning it into an agent.
That said, depending on what you mean by “effective”, perhaps the AI doesn’t even need to be able to answer questions like “How do we write a better version of you?”
For example, we find Google Maps to be very useful, even though if you asked Google Maps “How do we make a better version of Google Maps?” it would probably not be able to give the types of answers we want.
A tool-AI which was smarter than the smartest human, and yet which could not simply spit out a better version of itself would still probably be a very useful AI.
If someone asks the tool-AI “How do I create an agent-AI?” and it gives an answer, the distinction is moot anyways, because one leads to the other.
Given human nature, I find it extremely difficult to believe that nobody would ask the tool-AI that question, or something that’s close enough, and then implement the answer...
I am now imagining an AI which manages to misinterpret some straightforward medical problem as “cure cancer of it’s dependence on the host organism.”