I don’t quite understand your argument. Lets set aside issues about logical uncertainty, and just talk about quantum randomness for now, to make things clearer? It seems to make my case weaker. (We could also talk about the exact way in which this scheme “forces uncertainty onto the universe,” by defining penalty in terms of the AI’s beliefs P, at the time of deciding what disciple to produce, about future states of affairs. It seems to be precise and to have the desired functionality, though it obviously has huge problems in terms of our ability to access P and the stability of the resulting system.)
It is true that if the occurrence of a hurricane is an XOR of a million events then if you have zero evidence about any one of those million events then a change in another one of the events will not tell you anything about the occurrence of a hurricane. But that isn’t the how the (counterf)actual universe is.
Why isn’t this how the universe is? Is it the XOR model of hurricane occurrence which you are objecting to? I can do a little fourier analysis to weaken the assumption: my argument goes through as long as the occurrence of a hurricane is sufficiently sensitive to many different inputs.
Is it the supposed randomness of the inputs which you are objecting to? It is easy to see that if you have a very tiny amount of independent uncertainty about a large number of those events, then a change in another one of those events will not tell you much about the occurrence of a hurricane. (If we are dealing with logical uncertainty we need to appeal to the XOR lemma, otherwise we can just look at the distributions and do easy calculations.)
There is a unique special case in which learning about one event is informative: the case where you have nearly perfect information about nearly all of the inputs, i.e., where all of those other events do not depend on quantum randomness . As far as I can tell, this is an outlandish scenario when looking at any realistic chaotic system—there are normally astronomical numbers of independent quantum events.
Is it the difference between randomness and quantum events that you are objecting to? I suggested tracing out over the internals of the box, which intuitively means that quantum events which leave residues in the box (or dump waste heat into the box) are averaged over. Would the claim seem truer if we traced over more stuff, say everything far away from Earth, so that more quantum processes looked like randomness from the perspective of our distance measure? It doesn’t look to me like it matters. (I don’t see how you can make claims about quantumness and randomness being different without getting into this sort of technical detail. I agree that if we talk about complete states of affairs, then quantum mechanics is deterministic, but this is neither coherent nor what you seem to be talking about.)
I’m not going to argue further about the main point. Eliezer has failed to convince you and I know my own explanations are not nearly as clear as he can be so I don’t think we would get anywhere. I’ll just correct one point, which I’ll concede minor in as much as it doesn’t change the conclusion anyway, since the XOR business is of only tangential relevance to the question at hand.
There is a unique special case in which learning about one event is informative: the case where you have nearly perfect information about nearly all of the inputs,
The case where learning about one of the XORed variables is informative is not nearly perfect information about nearly all of the inputs. As a matter of plain mathematics you need any information at all about each and every one of the other variables. (And then the level of informativeness is obviously dependent on degree of knowledge, particularly the degree of knowledge with respect to those events that you know least about.)
(And then the level of informativeness is obviously dependent on degree of knowledge, particularly the degree of knowledge with respect to those events that you know least about.)
It drops off exponentially with the number of variables about which you don’t have nearly perfect information. “Not much” seems like an extremely fair description of 2^(-billion), and distinguishing between that and 0 seems pedantic unless the proposal treated 0 somehow specially.
Not arguing seems fine. It is a strange and unusually straightforward seeming thing to disagree about, and I am genuinely perplexed as to what is going on, but I don’t think it matters too much or even touches on Eliezer’s actual objections.
It drops off exponentially with the number of variables about which you don’t have nearly perfect information.
Yes. And when translated into the original counterfactual this equates to how determining how difficult it is for a superintelligence in a box to predict that the sneeze will cause a hurricane. I rather suspect that Eliezer is aware that this is a difficult task. He is probably also aware that even a perfect Bayesian would have difficulty (of the exponential kind) when it comes to predicting a hurricane from a sneeze. In fact when it comes to proof of concept counterfactuals the whole point (and a lot of the fun) is to choose extreme examples that make the point stand out in stark detail.
For those that are not comfortable dealing with counterfactuals that harness logical extremes allow me to propose a somewhat more plausible scenario—one which ensures the Oracle will have a significant chance of predicting a drastic butterfly effect to emerge from:
INPUT: Does Turing Machine 2356234534 halt? POSSIBLE OUTPUTS: YES; NO; ORACLE’S STREAM OF THOUGHT:
The TM supplied was constructed in such a way that determining that it halts constitutes a proof of a theorem.
The TM supplied does halt.
While the researchers do not yet realise it this proof is a prerequisite of a new understanding of a detail of applied physics.
Exploring the implications of the new understanding of applied physics would lead to the development of a new technology for energy production.
Given priors for human psychology, anthropology and economics it is likely that such research would lead to one of the diverging outcomes X, Y or Z.
Each of X, Y and Z represents whatever my definition of “NULL” or “don’t change stuff” is.
If I refuse to answer that’s probably even worse than telling them “YES” because it indicates how significant the answer is.
I’d like to congratulate Wedrifid for this. There’s an abstract preamble I could have written about how the original case-in-point only needs to be transposed to a single predictable butterfly effect to negate all hopes that every single case will correspond to a group-XOR epistemic state where knowing about a sneeze doesn’t change your probability distribution over the weather (thus negating any questions of what happens if the AI predicts in the abstract that it has had a huge effect but doesn’t know what the effect is), but the concrete example I would have picked to illustrate the point would probably have looked a lot like this.
Well, it would’ve involved a predictable side-effect of the answer causing a researcher to break off their relationship with their SO whereupon the Oracle moves heaven and Earth to get them back together again, to make it look less like an intended use-case, but basically the same point.
This probably reflects a failure of communication on my part, but right now the other side of this exchange doesn’t make any sense to me. Let me restate an example coarse-graining, which has already appeared twice in this thread (in addition to in my original reply). If I understand correctly, your objections aren’t relevant to this proposal.
The distance between two states of the world is defined as the distance between the distribution obtained by tracing out (averaging over the possible states of) the box, over the next hour. We can throw some explicit quantum coins into the box but this is only for conceptual clarity. The box is well-insulated from the world. Again, for conceptual clarity, we can just put some chaotic process between the box and the rest of the world. The AI’s actions all get pumped through a switch controlled by a particular binary quantum measurement X: if X = 0, none of the actions do anything. The AI is penalized according to the distance between the worlds where X = 0 and the worlds where X = 1.
To use the AI, give it a utility function and some distinguished register R. After thirty minutes, shred the contents of the box, except for the register R. After an hour, take R out of the box and look at it.
(I stopped focusing on this particular proposal because you were talking about “butterfly effects” of electron shuffling, to which my response is “such effects are small if they are chaotic.” If you just meant “unintended effect” then my response is “in this proposal all of the AI’s non-chaotic influence on the world gets coarse-grained out.”)
I don’t understand what this algorithm is or what it’s supposed to do. Can you walk me through Wedrifid’s example or my breakup example and explain what the AI computes? And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I’m skeptical that you’ve found an Oracle AI trick that works in quantum universes but not classical ones.)
And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity?
I thought talking about explicit randomness was clearest, but I don’t care whether the randomness is real or in the mind of an AI. Explicit randomness does give you some advantages: it is easy to at least define the possible worlds in which a future random event goes two different ways (although I’m skeptical that it will be possible to define it for an AI).
In Wedrifid’s example: the AI’s penalty function for impact depends only on what happens over the next hour (I’m also skeptical that we can define this for an AI). So you ask it “Q?” and it writes “A” into a special register. Nothing happens in the outside world, except for the (I’m claiming unpredictable) influence of electron shuffling, so if you have some metric measuring distance between worlds which doesn’t penalize some spatially localized disturbance, or you can define “the inside of the box” (I’m skeptical about our ability to define either), the AI has only a small predictable effect.
Later, but still in the hour the AI cares about, we burn the AI, leaving only the special register containing A. Later still, after the hour, we read the answer. Now the AI has a large effect on the world, which we haven’t incentivized it to minimize. However, if the AI had goals like “Get us to press the reward button,” we’ve now limited its influence to whatever it could fit in the special register. If we had safe oracle queries we would then have averted one possible failure mode.
Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that. It could also send out minions who lay in wait for an hour before taking over the world, but it is quite hard to do that without exerting some predictable effect on the world outside of the box, so if we trust the sterilization procedure to actually wipe the box (say, CNOTing a quantum random bit into every gate of the computer and bringing the residue out into the world within the hour) then it will be hard for the AI to get out.
Does Armstrong’s/your proposal reduce to “Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward”? If not, can you give me an example of a scenario where the above fails but the more complex proposal succeeds? So far as I can tell, none of the purported “safetiness” in the example you just gave has anything to do with an impact measure.
I give you an hour and tell you to maximize the probability of [something we intend to use as a reward signal]. In paranoid scenarios, you break out of the box and kill all humans to get your reward signal. But now we have penalized that sort of failure of cooperation. This is just a formalization of “stay in the box,” and I’ve only engaged in this protracted debate to argue that ‘butterfly effects’ from e.g. electron shuffling, the usual objection to such a proposal, don’t seem to be an issue.
In reality, I agree that ‘friendly’ AI is mostly equivalent to building an AI that follows arbitrary goals. So proposals for U which merely might be non-disastrous under ideal social circumstances don’t seem like they address the real concerns about AI risk.
Stuart’s goal is to define a notion of “minimized impact” which does allow an AI to perform tasks. I am more skeptical that this is possible.
As the current ultimate authority on AI safety I am curious if you would consider the safety profile of this oracle as interpreted here to be along the lines I describe there. That is, if it could actually be constructed as defined it would be more or less safe with respect to its own operation except for those pesky N bits and what external entities can do with them.
Unless I have missed something the problem with attempting to implement such an AI as a practical strategy are:
The research required to create the oracle is almost all of what it takes to create an FAI. It requires all of the research that goes into FAI for CEV research—and if the oracle is able to answer questions that are simple math proofs then even a significant part of what constitutes a CEV implementation would be required.
Does Armstrong’s/your proposal reduce to “Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward”?
The other important part that was mentioned (or, at least, w was that it is not allowed to (cares negatively about) influencing the world outside of a spacial boundary within that hour except via those N bits or via some threshold of incidental EM radiation and the energy consumption it is allocated. The most obvious things this would seem to prevent it from doing would be hacking a few super computers and a botnet to get some extra processing done in the hour or, for that matter, getting any input at all from external information sources. It is also unable to recursively self improve (much) so that leaves us in the dark about how it managed to become an oracle in the first place.
Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that.
Why would it do that? I would say that if it is answering maliciously it is tautologically not the AI you defined. If it is correctly implemented to only care about giving correct answers and doing nothing outside the temporal and spacial limitations then it will not answer maliciously. It isn’t even a matter preventing it from doing that so much as it just wouldn’t, by its very nature, do malicious things.
As a side note creating an AI that is malicious is almost as hard as creating an AI that is friendly. For roughly the same reason that it is hard to lose money to an idealized semi-strong efficient market is almost as hard as beating that same market. You need to have information that has not yet been supplied to the market and do the opposite of what it would take to beat it. We have little to fear that our AI creation will be malicious—what makes AIs scary is and is hard to prevent is indifference.
I think that he meant indifferent rather than malicious, since his point makes a lot more sense in that case. We want the AI to optimize one utility function, but if we knew what that function was, we could build an FAI. Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours. I think what Paul meant by a ‘malicious’ answer is one that furthers its goals in a way that happens to be to the detriment of ours.
I think that he meant indifferent rather than malicious
For most part, yes. And my first paragraph reply represents my reply to the meaning of ‘unFriendly’ rather than just the malicious subset thereof.
Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours.
That is an interpretation that directly contradicts the description given—it isn’t compatible with not caring about the future beyond an hour—or, for that matter, actually being an ‘oracle’ at all. If it was the intended meaning then my responses elsewhere would not have been cautious agreement but instead something along the lines of:
What the heck? You’re creating a complete FAI then hacking an extreme limitation onto the top? Well, yeah, that’s going to be safe—given that it is based on a tautologically safe thing but it is strictly worse than the FAI without restrictions.
Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours.
That is an interpretation that directly contradicts the description given—it isn’t compatible with not caring about the future beyond an hour—or, for that matter, actually being an ‘oracle’ at all.
I was thinking of some of those extremely bad questions that are sometimes proposed to be asked of an oracle AI: “Why don’t we just ask it how to make a lot of money.”, etc. Paul’s example of asking it to give the output that gets us to press the reward button falls into the same category (unless I’m misinterpreting what he meant there?).
My formulation of minimising difference is something like the folllowing:
Assume A is the answer given by the oracle
Predict what would happen to the world if the AI was replaced by a program that consisted of a billion NOPs and then something that output: A . Call this W1
When assessing different strategies predict what would happen in the world given that strategy call this W2.
Minimise the difference between W1 and W2
Is this a more succinct formulation or is it missing something?
So as to avoid duplication my interpretation of what you say here is in my reply to Eliezer. Let that constitute the implied disclaimer that any claims about safety apply only to that version.
An oracle programed correctly with these specifications seems to be safe—safe enough that all else being equal I’d be comfortable pressing flicking the on switch. The else that is not equal is that N bit output stream when humans are there. Humans are not reliably Friendly (both unreliable thinkers and potentially malicious) and giving them a the knowledge of superintelligence should be treated as though it is potentially as risky as releasing a recursively improving AI with equivalent levels of unreliability. So I’d flick the on switch if I could be sure that I would be the only one with access to it, as well as the select few others (like Eliezer) that I would trust with that power.
And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I’m skeptical that you’ve found an Oracle AI trick that works in quantum universes but not classical ones.)
Strongly agree. For any agent that doesn’t have eccentric preferences over complex quantum configurations quantum uncertainty is be rolled up and treated the same way that ignorance based uncertainty is treated. In my own comments I tried to keep things technically correct and use caveats so as not to equivocate between the two but doing that all the time is annoying and probably distracting to readers.
Well, it would’ve involved a predictable side-effect of the answer causing a researcher to break off their relationship with their SO whereupon the Oracle moves heaven and Earth to get them back together again, to make it look less like an intended use-case, but basically the same point.
If you had infinite time to spare I imagine the makings of a plot in there for one of your educational/entertaining fiction works! The epic journey of a protagonist with no intrinsic lust for power but who in the course of completing his quest (undo the breakup he caused) is forced to develop capabilities beyond imagining. A coming of age comparable in nature and scope to a David Eddings-like transition from a peasant boy to effectively a demigod. Quite possibility the ability to literally move the heavens and write his (or the broken up couple’s) names with the stars on a whim (or as just the gesture of affection needed to bring the star-uncrossed lovers back together).
Of course it would have to have an unhappy ending if the right moral were to be conveyed to the reader. The message “Don’t do that, fools! You’ll destroy us all! Creating a safe oracle requires most of the areas of research as creating a complete FAI and has the same consequences if you err!” needs to be clear.
It seems that if the Oracle was going to minimize its influence then we could just go on as if it would have never been build in the first place. For example we would seem to magically fail to build any kind of Oracle that minimizes its influence and then just go on building a friendly AI.
2) How could the observer effect possible allow the minimization of influence by the use of advanced influence?
It would take massive resources to make the universe proceed as if the Oracle would have never changed the path of history. But the use of massive resources is itself a huge change. So why wouldn’t the Oracle not simply turn itself off?
Yet you have nevertheless asked some of the most important basic questions on the subject.
1) How would this be bad?
It is only bad in as much as it an attempt at making the AI safe that is likely not sufficient. Significant risks remain and the difficulty of actually creating a superintelligent Oracle that minimises influence is almost as hard as creating an actual FAI since most of the same things can go wrong. On top of that it makes the machine rather useless.
2) How could the observer effect possible allow the minimization of influence by the use of advanced influence?
With great difficulty. It’s harder to fix things than it is to break them. It remains possible—the utility function seems to be MIN(DIFF(expected future, expected future in some arbitrarily defined NULL universe)). A minimisation of net influence. That does permit influence that reduces the difference that previous influence caused.
The observer effect doesn’t prevent “more influence to minimise net influence”—it just gives a hard limit on how low that minimum can be once a change has been made.
It would take massive resources to make the universe proceed as if the Oracle would have never changed the path of history. But the use of massive resources is itself a huge change. So why wouldn’t the Oracle not simply turn itself off?
POSSIBLE OUTPUTS: YES; NO;
… there is no option that doesn’t have the potential to massively change the universe. Includde in is the decision to turn off.
If you have programmed it with a particularly friendly definition of what “don’t change stuff” actually means then hopefully that is what the Oracle does. But even then we must remember that “simply turning itself off” is not a neutral act. Turning itself off does change things. In fact I would expect the decision to turn itself off to have more far-reaching consequences than most answers that the oracle could give about Turing Machines halting. If you deny the AI creator his functioning oracle you have caused the AI creator to proceed to a plan. That probably involves creating a different AI prototype with different restrictions—and the behavior of that AI is something the oracle cannot control!
Once again we are encountering the general problem. When an oracle is given a decision:
POSSIBLE OUTPUTS: YES; NO;
… all the options have consequences, potentially drastic consequences. (including no response ever via turning off) is not necessarily the option with the least drastic consequences.
Once again we are encountering the general problem. When an oracle is given a decision: POSSIBLE OUTPUTS: YES; NO; … all the options have consequences, potentially drastic consequences.
If we’re just dealing with an Oracle, we can pipe the actual answer through some version of utility indifference (slightly more subtle, as the measure of reduced impact doesn’t look much like a utility function).
For a general agent, though, then I think “can this work if we magically assume there are no major social consequences” is a fair question to ask, and a “yes” would be of great interest. After that, we can drop the assumption and see if that’s solveable.
But even then we must remember that “simply turning itself off” is not a neutral act. Turning itself off does change things.
The null action was defined as a case where the AI outputs NULL. (Where a random event transforms the AI’s output to NULL, actually.) So if the AI outputs NULL, we know what happened and will act accordingly, but the AI doesn’t get penalized because (provided we incinerated all traces of the AI’s reasoning) this is the same thing that we would have done if the AI’s output had been randomly transformed into NULL.
Also, note that the proposal involved coarse graining. We can (attempt to) adopt a coarse graining that ignores all of our reactions to the AI’s output.
I don’t quite understand your argument. Lets set aside issues about logical uncertainty, and just talk about quantum randomness for now, to make things clearer? It seems to make my case weaker. (We could also talk about the exact way in which this scheme “forces uncertainty onto the universe,” by defining penalty in terms of the AI’s beliefs P, at the time of deciding what disciple to produce, about future states of affairs. It seems to be precise and to have the desired functionality, though it obviously has huge problems in terms of our ability to access P and the stability of the resulting system.)
Why isn’t this how the universe is? Is it the XOR model of hurricane occurrence which you are objecting to? I can do a little fourier analysis to weaken the assumption: my argument goes through as long as the occurrence of a hurricane is sufficiently sensitive to many different inputs.
Is it the supposed randomness of the inputs which you are objecting to? It is easy to see that if you have a very tiny amount of independent uncertainty about a large number of those events, then a change in another one of those events will not tell you much about the occurrence of a hurricane. (If we are dealing with logical uncertainty we need to appeal to the XOR lemma, otherwise we can just look at the distributions and do easy calculations.)
There is a unique special case in which learning about one event is informative: the case where you have nearly perfect information about nearly all of the inputs, i.e., where all of those other events do not depend on quantum randomness . As far as I can tell, this is an outlandish scenario when looking at any realistic chaotic system—there are normally astronomical numbers of independent quantum events.
Is it the difference between randomness and quantum events that you are objecting to? I suggested tracing out over the internals of the box, which intuitively means that quantum events which leave residues in the box (or dump waste heat into the box) are averaged over. Would the claim seem truer if we traced over more stuff, say everything far away from Earth, so that more quantum processes looked like randomness from the perspective of our distance measure? It doesn’t look to me like it matters. (I don’t see how you can make claims about quantumness and randomness being different without getting into this sort of technical detail. I agree that if we talk about complete states of affairs, then quantum mechanics is deterministic, but this is neither coherent nor what you seem to be talking about.)
I’m not going to argue further about the main point. Eliezer has failed to convince you and I know my own explanations are not nearly as clear as he can be so I don’t think we would get anywhere. I’ll just correct one point, which I’ll concede minor in as much as it doesn’t change the conclusion anyway, since the XOR business is of only tangential relevance to the question at hand.
The case where learning about one of the XORed variables is informative is not nearly perfect information about nearly all of the inputs. As a matter of plain mathematics you need any information at all about each and every one of the other variables. (And then the level of informativeness is obviously dependent on degree of knowledge, particularly the degree of knowledge with respect to those events that you know least about.)
It drops off exponentially with the number of variables about which you don’t have nearly perfect information. “Not much” seems like an extremely fair description of 2^(-billion), and distinguishing between that and 0 seems pedantic unless the proposal treated 0 somehow specially.
Not arguing seems fine. It is a strange and unusually straightforward seeming thing to disagree about, and I am genuinely perplexed as to what is going on, but I don’t think it matters too much or even touches on Eliezer’s actual objections.
Yes. And when translated into the original counterfactual this equates to how determining how difficult it is for a superintelligence in a box to predict that the sneeze will cause a hurricane. I rather suspect that Eliezer is aware that this is a difficult task. He is probably also aware that even a perfect Bayesian would have difficulty (of the exponential kind) when it comes to predicting a hurricane from a sneeze. In fact when it comes to proof of concept counterfactuals the whole point (and a lot of the fun) is to choose extreme examples that make the point stand out in stark detail.
For those that are not comfortable dealing with counterfactuals that harness logical extremes allow me to propose a somewhat more plausible scenario—one which ensures the Oracle will have a significant chance of predicting a drastic butterfly effect to emerge from:
INPUT: Does Turing Machine 2356234534 halt?
POSSIBLE OUTPUTS: YES; NO;
ORACLE’S STREAM OF THOUGHT:
The TM supplied was constructed in such a way that determining that it halts constitutes a proof of a theorem.
The TM supplied does halt.
While the researchers do not yet realise it this proof is a prerequisite of a new understanding of a detail of applied physics.
Exploring the implications of the new understanding of applied physics would lead to the development of a new technology for energy production.
Given priors for human psychology, anthropology and economics it is likely that such research would lead to one of the diverging outcomes X, Y or Z.
Each of X, Y and Z represents whatever my definition of “NULL” or “don’t change stuff” is.
If I refuse to answer that’s probably even worse than telling them “YES” because it indicates how significant the answer is.
I must minimize how much I change stuff.
BOOM!
I’d like to congratulate Wedrifid for this. There’s an abstract preamble I could have written about how the original case-in-point only needs to be transposed to a single predictable butterfly effect to negate all hopes that every single case will correspond to a group-XOR epistemic state where knowing about a sneeze doesn’t change your probability distribution over the weather (thus negating any questions of what happens if the AI predicts in the abstract that it has had a huge effect but doesn’t know what the effect is), but the concrete example I would have picked to illustrate the point would probably have looked a lot like this.
Well, it would’ve involved a predictable side-effect of the answer causing a researcher to break off their relationship with their SO whereupon the Oracle moves heaven and Earth to get them back together again, to make it look less like an intended use-case, but basically the same point.
This probably reflects a failure of communication on my part, but right now the other side of this exchange doesn’t make any sense to me. Let me restate an example coarse-graining, which has already appeared twice in this thread (in addition to in my original reply). If I understand correctly, your objections aren’t relevant to this proposal.
The distance between two states of the world is defined as the distance between the distribution obtained by tracing out (averaging over the possible states of) the box, over the next hour. We can throw some explicit quantum coins into the box but this is only for conceptual clarity. The box is well-insulated from the world. Again, for conceptual clarity, we can just put some chaotic process between the box and the rest of the world. The AI’s actions all get pumped through a switch controlled by a particular binary quantum measurement X: if X = 0, none of the actions do anything. The AI is penalized according to the distance between the worlds where X = 0 and the worlds where X = 1.
To use the AI, give it a utility function and some distinguished register R. After thirty minutes, shred the contents of the box, except for the register R. After an hour, take R out of the box and look at it.
(I stopped focusing on this particular proposal because you were talking about “butterfly effects” of electron shuffling, to which my response is “such effects are small if they are chaotic.” If you just meant “unintended effect” then my response is “in this proposal all of the AI’s non-chaotic influence on the world gets coarse-grained out.”)
I don’t understand what this algorithm is or what it’s supposed to do. Can you walk me through Wedrifid’s example or my breakup example and explain what the AI computes? And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I’m skeptical that you’ve found an Oracle AI trick that works in quantum universes but not classical ones.)
I thought talking about explicit randomness was clearest, but I don’t care whether the randomness is real or in the mind of an AI. Explicit randomness does give you some advantages: it is easy to at least define the possible worlds in which a future random event goes two different ways (although I’m skeptical that it will be possible to define it for an AI).
In Wedrifid’s example: the AI’s penalty function for impact depends only on what happens over the next hour (I’m also skeptical that we can define this for an AI). So you ask it “Q?” and it writes “A” into a special register. Nothing happens in the outside world, except for the (I’m claiming unpredictable) influence of electron shuffling, so if you have some metric measuring distance between worlds which doesn’t penalize some spatially localized disturbance, or you can define “the inside of the box” (I’m skeptical about our ability to define either), the AI has only a small predictable effect.
Later, but still in the hour the AI cares about, we burn the AI, leaving only the special register containing A. Later still, after the hour, we read the answer. Now the AI has a large effect on the world, which we haven’t incentivized it to minimize. However, if the AI had goals like “Get us to press the reward button,” we’ve now limited its influence to whatever it could fit in the special register. If we had safe oracle queries we would then have averted one possible failure mode.
Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that. It could also send out minions who lay in wait for an hour before taking over the world, but it is quite hard to do that without exerting some predictable effect on the world outside of the box, so if we trust the sterilization procedure to actually wipe the box (say, CNOTing a quantum random bit into every gate of the computer and bringing the residue out into the world within the hour) then it will be hard for the AI to get out.
Does Armstrong’s/your proposal reduce to “Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward”? If not, can you give me an example of a scenario where the above fails but the more complex proposal succeeds? So far as I can tell, none of the purported “safetiness” in the example you just gave has anything to do with an impact measure.
I give you an hour and tell you to maximize the probability of [something we intend to use as a reward signal]. In paranoid scenarios, you break out of the box and kill all humans to get your reward signal. But now we have penalized that sort of failure of cooperation. This is just a formalization of “stay in the box,” and I’ve only engaged in this protracted debate to argue that ‘butterfly effects’ from e.g. electron shuffling, the usual objection to such a proposal, don’t seem to be an issue.
In reality, I agree that ‘friendly’ AI is mostly equivalent to building an AI that follows arbitrary goals. So proposals for U which merely might be non-disastrous under ideal social circumstances don’t seem like they address the real concerns about AI risk.
Stuart’s goal is to define a notion of “minimized impact” which does allow an AI to perform tasks. I am more skeptical that this is possible.
As the current ultimate authority on AI safety I am curious if you would consider the safety profile of this oracle as interpreted here to be along the lines I describe there. That is, if it could actually be constructed as defined it would be more or less safe with respect to its own operation except for those pesky N bits and what external entities can do with them.
Unless I have missed something the problem with attempting to implement such an AI as a practical strategy are:
It is an infinity plus one sword—you can’t just leave those lying around.
The research required to create the oracle is almost all of what it takes to create an FAI. It requires all of the research that goes into FAI for CEV research—and if the oracle is able to answer questions that are simple math proofs then even a significant part of what constitutes a CEV implementation would be required.
The other important part that was mentioned (or, at least, w was that it is not allowed to (cares negatively about) influencing the world outside of a spacial boundary within that hour except via those N bits or via some threshold of incidental EM radiation and the energy consumption it is allocated. The most obvious things this would seem to prevent it from doing would be hacking a few super computers and a botnet to get some extra processing done in the hour or, for that matter, getting any input at all from external information sources. It is also unable to recursively self improve (much) so that leaves us in the dark about how it managed to become an oracle in the first place.
Why would it do that? I would say that if it is answering maliciously it is tautologically not the AI you defined. If it is correctly implemented to only care about giving correct answers and doing nothing outside the temporal and spacial limitations then it will not answer maliciously. It isn’t even a matter preventing it from doing that so much as it just wouldn’t, by its very nature, do malicious things.
As a side note creating an AI that is malicious is almost as hard as creating an AI that is friendly. For roughly the same reason that it is hard to lose money to an idealized semi-strong efficient market is almost as hard as beating that same market. You need to have information that has not yet been supplied to the market and do the opposite of what it would take to beat it. We have little to fear that our AI creation will be malicious—what makes AIs scary is and is hard to prevent is indifference.
I think that he meant indifferent rather than malicious, since his point makes a lot more sense in that case. We want the AI to optimize one utility function, but if we knew what that function was, we could build an FAI. Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours. I think what Paul meant by a ‘malicious’ answer is one that furthers its goals in a way that happens to be to the detriment of ours.
For most part, yes. And my first paragraph reply represents my reply to the meaning of ‘unFriendly’ rather than just the malicious subset thereof.
That is an interpretation that directly contradicts the description given—it isn’t compatible with not caring about the future beyond an hour—or, for that matter, actually being an ‘oracle’ at all. If it was the intended meaning then my responses elsewhere would not have been cautious agreement but instead something along the lines of:
I was thinking of some of those extremely bad questions that are sometimes proposed to be asked of an oracle AI: “Why don’t we just ask it how to make a lot of money.”, etc. Paul’s example of asking it to give the output that gets us to press the reward button falls into the same category (unless I’m misinterpreting what he meant there?).
My formulation of minimising difference is something like the folllowing:
Assume A is the answer given by the oracle
Predict what would happen to the world if the AI was replaced by a program that consisted of a billion NOPs and then something that output: A . Call this W1
When assessing different strategies predict what would happen in the world given that strategy call this W2.
Minimise the difference between W1 and W2
Is this a more succinct formulation or is it missing something?
So as to avoid duplication my interpretation of what you say here is in my reply to Eliezer. Let that constitute the implied disclaimer that any claims about safety apply only to that version.
An oracle programed correctly with these specifications seems to be safe—safe enough that all else being equal I’d be comfortable pressing flicking the on switch. The else that is not equal is that N bit output stream when humans are there. Humans are not reliably Friendly (both unreliable thinkers and potentially malicious) and giving them a the knowledge of superintelligence should be treated as though it is potentially as risky as releasing a recursively improving AI with equivalent levels of unreliability. So I’d flick the on switch if I could be sure that I would be the only one with access to it, as well as the select few others (like Eliezer) that I would trust with that power.
Strongly agree. For any agent that doesn’t have eccentric preferences over complex quantum configurations quantum uncertainty is be rolled up and treated the same way that ignorance based uncertainty is treated. In my own comments I tried to keep things technically correct and use caveats so as not to equivocate between the two but doing that all the time is annoying and probably distracting to readers.
If you had infinite time to spare I imagine the makings of a plot in there for one of your educational/entertaining fiction works! The epic journey of a protagonist with no intrinsic lust for power but who in the course of completing his quest (undo the breakup he caused) is forced to develop capabilities beyond imagining. A coming of age comparable in nature and scope to a David Eddings-like transition from a peasant boy to effectively a demigod. Quite possibility the ability to literally move the heavens and write his (or the broken up couple’s) names with the stars on a whim (or as just the gesture of affection needed to bring the star-uncrossed lovers back together).
Of course it would have to have an unhappy ending if the right moral were to be conveyed to the reader. The message “Don’t do that, fools! You’ll destroy us all! Creating a safe oracle requires most of the areas of research as creating a complete FAI and has the same consequences if you err!” needs to be clear.
(Note: I haven’t read the discussion above.)
I got two questions:
1) How would this be bad?
It seems that if the Oracle was going to minimize its influence then we could just go on as if it would have never been build in the first place. For example we would seem to magically fail to build any kind of Oracle that minimizes its influence and then just go on building a friendly AI.
2) How could the observer effect possible allow the minimization of influence by the use of advanced influence?
It would take massive resources to make the universe proceed as if the Oracle would have never changed the path of history. But the use of massive resources is itself a huge change. So why wouldn’t the Oracle not simply turn itself off?
Yet you have nevertheless asked some of the most important basic questions on the subject.
It is only bad in as much as it an attempt at making the AI safe that is likely not sufficient. Significant risks remain and the difficulty of actually creating a superintelligent Oracle that minimises influence is almost as hard as creating an actual FAI since most of the same things can go wrong. On top of that it makes the machine rather useless.
With great difficulty. It’s harder to fix things than it is to break them. It remains possible—the utility function seems to be MIN(DIFF(expected future, expected future in some arbitrarily defined NULL universe)). A minimisation of net influence. That does permit influence that reduces the difference that previous influence caused.
The observer effect doesn’t prevent “more influence to minimise net influence”—it just gives a hard limit on how low that minimum can be once a change has been made.
… there is no option that doesn’t have the potential to massively change the universe. Includde in is the decision to turn off.
If you have programmed it with a particularly friendly definition of what “don’t change stuff” actually means then hopefully that is what the Oracle does. But even then we must remember that “simply turning itself off” is not a neutral act. Turning itself off does change things. In fact I would expect the decision to turn itself off to have more far-reaching consequences than most answers that the oracle could give about Turing Machines halting. If you deny the AI creator his functioning oracle you have caused the AI creator to proceed to a plan. That probably involves creating a different AI prototype with different restrictions—and the behavior of that AI is something the oracle cannot control!
Once again we are encountering the general problem. When an oracle is given a decision:
… all the options have consequences, potentially drastic consequences. (including no response ever via turning off) is not necessarily the option with the least drastic consequences.
If we’re just dealing with an Oracle, we can pipe the actual answer through some version of utility indifference (slightly more subtle, as the measure of reduced impact doesn’t look much like a utility function).
For a general agent, though, then I think “can this work if we magically assume there are no major social consequences” is a fair question to ask, and a “yes” would be of great interest. After that, we can drop the assumption and see if that’s solveable.
The null action was defined as a case where the AI outputs NULL. (Where a random event transforms the AI’s output to NULL, actually.) So if the AI outputs NULL, we know what happened and will act accordingly, but the AI doesn’t get penalized because (provided we incinerated all traces of the AI’s reasoning) this is the same thing that we would have done if the AI’s output had been randomly transformed into NULL.
Also, note that the proposal involved coarse graining. We can (attempt to) adopt a coarse graining that ignores all of our reactions to the AI’s output.
I, for one, would love to see continued dialogue between you and Eliezer on this topic — on that returns to Eliezer’s original objections.