This probably reflects a failure of communication on my part, but right now the other side of this exchange doesn’t make any sense to me. Let me restate an example coarse-graining, which has already appeared twice in this thread (in addition to in my original reply). If I understand correctly, your objections aren’t relevant to this proposal.
The distance between two states of the world is defined as the distance between the distribution obtained by tracing out (averaging over the possible states of) the box, over the next hour. We can throw some explicit quantum coins into the box but this is only for conceptual clarity. The box is well-insulated from the world. Again, for conceptual clarity, we can just put some chaotic process between the box and the rest of the world. The AI’s actions all get pumped through a switch controlled by a particular binary quantum measurement X: if X = 0, none of the actions do anything. The AI is penalized according to the distance between the worlds where X = 0 and the worlds where X = 1.
To use the AI, give it a utility function and some distinguished register R. After thirty minutes, shred the contents of the box, except for the register R. After an hour, take R out of the box and look at it.
(I stopped focusing on this particular proposal because you were talking about “butterfly effects” of electron shuffling, to which my response is “such effects are small if they are chaotic.” If you just meant “unintended effect” then my response is “in this proposal all of the AI’s non-chaotic influence on the world gets coarse-grained out.”)
I don’t understand what this algorithm is or what it’s supposed to do. Can you walk me through Wedrifid’s example or my breakup example and explain what the AI computes? And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I’m skeptical that you’ve found an Oracle AI trick that works in quantum universes but not classical ones.)
And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity?
I thought talking about explicit randomness was clearest, but I don’t care whether the randomness is real or in the mind of an AI. Explicit randomness does give you some advantages: it is easy to at least define the possible worlds in which a future random event goes two different ways (although I’m skeptical that it will be possible to define it for an AI).
In Wedrifid’s example: the AI’s penalty function for impact depends only on what happens over the next hour (I’m also skeptical that we can define this for an AI). So you ask it “Q?” and it writes “A” into a special register. Nothing happens in the outside world, except for the (I’m claiming unpredictable) influence of electron shuffling, so if you have some metric measuring distance between worlds which doesn’t penalize some spatially localized disturbance, or you can define “the inside of the box” (I’m skeptical about our ability to define either), the AI has only a small predictable effect.
Later, but still in the hour the AI cares about, we burn the AI, leaving only the special register containing A. Later still, after the hour, we read the answer. Now the AI has a large effect on the world, which we haven’t incentivized it to minimize. However, if the AI had goals like “Get us to press the reward button,” we’ve now limited its influence to whatever it could fit in the special register. If we had safe oracle queries we would then have averted one possible failure mode.
Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that. It could also send out minions who lay in wait for an hour before taking over the world, but it is quite hard to do that without exerting some predictable effect on the world outside of the box, so if we trust the sterilization procedure to actually wipe the box (say, CNOTing a quantum random bit into every gate of the computer and bringing the residue out into the world within the hour) then it will be hard for the AI to get out.
Does Armstrong’s/your proposal reduce to “Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward”? If not, can you give me an example of a scenario where the above fails but the more complex proposal succeeds? So far as I can tell, none of the purported “safetiness” in the example you just gave has anything to do with an impact measure.
I give you an hour and tell you to maximize the probability of [something we intend to use as a reward signal]. In paranoid scenarios, you break out of the box and kill all humans to get your reward signal. But now we have penalized that sort of failure of cooperation. This is just a formalization of “stay in the box,” and I’ve only engaged in this protracted debate to argue that ‘butterfly effects’ from e.g. electron shuffling, the usual objection to such a proposal, don’t seem to be an issue.
In reality, I agree that ‘friendly’ AI is mostly equivalent to building an AI that follows arbitrary goals. So proposals for U which merely might be non-disastrous under ideal social circumstances don’t seem like they address the real concerns about AI risk.
Stuart’s goal is to define a notion of “minimized impact” which does allow an AI to perform tasks. I am more skeptical that this is possible.
As the current ultimate authority on AI safety I am curious if you would consider the safety profile of this oracle as interpreted here to be along the lines I describe there. That is, if it could actually be constructed as defined it would be more or less safe with respect to its own operation except for those pesky N bits and what external entities can do with them.
Unless I have missed something the problem with attempting to implement such an AI as a practical strategy are:
The research required to create the oracle is almost all of what it takes to create an FAI. It requires all of the research that goes into FAI for CEV research—and if the oracle is able to answer questions that are simple math proofs then even a significant part of what constitutes a CEV implementation would be required.
Does Armstrong’s/your proposal reduce to “Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward”?
The other important part that was mentioned (or, at least, w was that it is not allowed to (cares negatively about) influencing the world outside of a spacial boundary within that hour except via those N bits or via some threshold of incidental EM radiation and the energy consumption it is allocated. The most obvious things this would seem to prevent it from doing would be hacking a few super computers and a botnet to get some extra processing done in the hour or, for that matter, getting any input at all from external information sources. It is also unable to recursively self improve (much) so that leaves us in the dark about how it managed to become an oracle in the first place.
Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that.
Why would it do that? I would say that if it is answering maliciously it is tautologically not the AI you defined. If it is correctly implemented to only care about giving correct answers and doing nothing outside the temporal and spacial limitations then it will not answer maliciously. It isn’t even a matter preventing it from doing that so much as it just wouldn’t, by its very nature, do malicious things.
As a side note creating an AI that is malicious is almost as hard as creating an AI that is friendly. For roughly the same reason that it is hard to lose money to an idealized semi-strong efficient market is almost as hard as beating that same market. You need to have information that has not yet been supplied to the market and do the opposite of what it would take to beat it. We have little to fear that our AI creation will be malicious—what makes AIs scary is and is hard to prevent is indifference.
I think that he meant indifferent rather than malicious, since his point makes a lot more sense in that case. We want the AI to optimize one utility function, but if we knew what that function was, we could build an FAI. Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours. I think what Paul meant by a ‘malicious’ answer is one that furthers its goals in a way that happens to be to the detriment of ours.
I think that he meant indifferent rather than malicious
For most part, yes. And my first paragraph reply represents my reply to the meaning of ‘unFriendly’ rather than just the malicious subset thereof.
Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours.
That is an interpretation that directly contradicts the description given—it isn’t compatible with not caring about the future beyond an hour—or, for that matter, actually being an ‘oracle’ at all. If it was the intended meaning then my responses elsewhere would not have been cautious agreement but instead something along the lines of:
What the heck? You’re creating a complete FAI then hacking an extreme limitation onto the top? Well, yeah, that’s going to be safe—given that it is based on a tautologically safe thing but it is strictly worse than the FAI without restrictions.
Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours.
That is an interpretation that directly contradicts the description given—it isn’t compatible with not caring about the future beyond an hour—or, for that matter, actually being an ‘oracle’ at all.
I was thinking of some of those extremely bad questions that are sometimes proposed to be asked of an oracle AI: “Why don’t we just ask it how to make a lot of money.”, etc. Paul’s example of asking it to give the output that gets us to press the reward button falls into the same category (unless I’m misinterpreting what he meant there?).
My formulation of minimising difference is something like the folllowing:
Assume A is the answer given by the oracle
Predict what would happen to the world if the AI was replaced by a program that consisted of a billion NOPs and then something that output: A . Call this W1
When assessing different strategies predict what would happen in the world given that strategy call this W2.
Minimise the difference between W1 and W2
Is this a more succinct formulation or is it missing something?
So as to avoid duplication my interpretation of what you say here is in my reply to Eliezer. Let that constitute the implied disclaimer that any claims about safety apply only to that version.
An oracle programed correctly with these specifications seems to be safe—safe enough that all else being equal I’d be comfortable pressing flicking the on switch. The else that is not equal is that N bit output stream when humans are there. Humans are not reliably Friendly (both unreliable thinkers and potentially malicious) and giving them a the knowledge of superintelligence should be treated as though it is potentially as risky as releasing a recursively improving AI with equivalent levels of unreliability. So I’d flick the on switch if I could be sure that I would be the only one with access to it, as well as the select few others (like Eliezer) that I would trust with that power.
And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I’m skeptical that you’ve found an Oracle AI trick that works in quantum universes but not classical ones.)
Strongly agree. For any agent that doesn’t have eccentric preferences over complex quantum configurations quantum uncertainty is be rolled up and treated the same way that ignorance based uncertainty is treated. In my own comments I tried to keep things technically correct and use caveats so as not to equivocate between the two but doing that all the time is annoying and probably distracting to readers.
This probably reflects a failure of communication on my part, but right now the other side of this exchange doesn’t make any sense to me. Let me restate an example coarse-graining, which has already appeared twice in this thread (in addition to in my original reply). If I understand correctly, your objections aren’t relevant to this proposal.
The distance between two states of the world is defined as the distance between the distribution obtained by tracing out (averaging over the possible states of) the box, over the next hour. We can throw some explicit quantum coins into the box but this is only for conceptual clarity. The box is well-insulated from the world. Again, for conceptual clarity, we can just put some chaotic process between the box and the rest of the world. The AI’s actions all get pumped through a switch controlled by a particular binary quantum measurement X: if X = 0, none of the actions do anything. The AI is penalized according to the distance between the worlds where X = 0 and the worlds where X = 1.
To use the AI, give it a utility function and some distinguished register R. After thirty minutes, shred the contents of the box, except for the register R. After an hour, take R out of the box and look at it.
(I stopped focusing on this particular proposal because you were talking about “butterfly effects” of electron shuffling, to which my response is “such effects are small if they are chaotic.” If you just meant “unintended effect” then my response is “in this proposal all of the AI’s non-chaotic influence on the world gets coarse-grained out.”)
I don’t understand what this algorithm is or what it’s supposed to do. Can you walk me through Wedrifid’s example or my breakup example and explain what the AI computes? And can we talk about probability distributions only inside epistemic states computed by the AI, in a classical universe, for simplicity? (I’m skeptical that you’ve found an Oracle AI trick that works in quantum universes but not classical ones.)
I thought talking about explicit randomness was clearest, but I don’t care whether the randomness is real or in the mind of an AI. Explicit randomness does give you some advantages: it is easy to at least define the possible worlds in which a future random event goes two different ways (although I’m skeptical that it will be possible to define it for an AI).
In Wedrifid’s example: the AI’s penalty function for impact depends only on what happens over the next hour (I’m also skeptical that we can define this for an AI). So you ask it “Q?” and it writes “A” into a special register. Nothing happens in the outside world, except for the (I’m claiming unpredictable) influence of electron shuffling, so if you have some metric measuring distance between worlds which doesn’t penalize some spatially localized disturbance, or you can define “the inside of the box” (I’m skeptical about our ability to define either), the AI has only a small predictable effect.
Later, but still in the hour the AI cares about, we burn the AI, leaving only the special register containing A. Later still, after the hour, we read the answer. Now the AI has a large effect on the world, which we haven’t incentivized it to minimize. However, if the AI had goals like “Get us to press the reward button,” we’ve now limited its influence to whatever it could fit in the special register. If we had safe oracle queries we would then have averted one possible failure mode.
Of course the AI could choose its answer maliciously, and I am very skeptical about the possibility of preventing that. It could also send out minions who lay in wait for an hour before taking over the world, but it is quite hard to do that without exerting some predictable effect on the world outside of the box, so if we trust the sterilization procedure to actually wipe the box (say, CNOTing a quantum random bit into every gate of the computer and bringing the residue out into the world within the hour) then it will be hard for the AI to get out.
Does Armstrong’s/your proposal reduce to “Give the AI a utility function that cares about nothing beyond the next hour, restrict its output to N bits, and blow up the rest of the computer afterward”? If not, can you give me an example of a scenario where the above fails but the more complex proposal succeeds? So far as I can tell, none of the purported “safetiness” in the example you just gave has anything to do with an impact measure.
I give you an hour and tell you to maximize the probability of [something we intend to use as a reward signal]. In paranoid scenarios, you break out of the box and kill all humans to get your reward signal. But now we have penalized that sort of failure of cooperation. This is just a formalization of “stay in the box,” and I’ve only engaged in this protracted debate to argue that ‘butterfly effects’ from e.g. electron shuffling, the usual objection to such a proposal, don’t seem to be an issue.
In reality, I agree that ‘friendly’ AI is mostly equivalent to building an AI that follows arbitrary goals. So proposals for U which merely might be non-disastrous under ideal social circumstances don’t seem like they address the real concerns about AI risk.
Stuart’s goal is to define a notion of “minimized impact” which does allow an AI to perform tasks. I am more skeptical that this is possible.
As the current ultimate authority on AI safety I am curious if you would consider the safety profile of this oracle as interpreted here to be along the lines I describe there. That is, if it could actually be constructed as defined it would be more or less safe with respect to its own operation except for those pesky N bits and what external entities can do with them.
Unless I have missed something the problem with attempting to implement such an AI as a practical strategy are:
It is an infinity plus one sword—you can’t just leave those lying around.
The research required to create the oracle is almost all of what it takes to create an FAI. It requires all of the research that goes into FAI for CEV research—and if the oracle is able to answer questions that are simple math proofs then even a significant part of what constitutes a CEV implementation would be required.
The other important part that was mentioned (or, at least, w was that it is not allowed to (cares negatively about) influencing the world outside of a spacial boundary within that hour except via those N bits or via some threshold of incidental EM radiation and the energy consumption it is allocated. The most obvious things this would seem to prevent it from doing would be hacking a few super computers and a botnet to get some extra processing done in the hour or, for that matter, getting any input at all from external information sources. It is also unable to recursively self improve (much) so that leaves us in the dark about how it managed to become an oracle in the first place.
Why would it do that? I would say that if it is answering maliciously it is tautologically not the AI you defined. If it is correctly implemented to only care about giving correct answers and doing nothing outside the temporal and spacial limitations then it will not answer maliciously. It isn’t even a matter preventing it from doing that so much as it just wouldn’t, by its very nature, do malicious things.
As a side note creating an AI that is malicious is almost as hard as creating an AI that is friendly. For roughly the same reason that it is hard to lose money to an idealized semi-strong efficient market is almost as hard as beating that same market. You need to have information that has not yet been supplied to the market and do the opposite of what it would take to beat it. We have little to fear that our AI creation will be malicious—what makes AIs scary is and is hard to prevent is indifference.
I think that he meant indifferent rather than malicious, since his point makes a lot more sense in that case. We want the AI to optimize one utility function, but if we knew what that function was, we could build an FAI. Instead, we make an Oracle AI with an approximation to our utility function. Then, the AI will act so as to use its output to get us to accomplish its goals, which are only mostly aligned with ours. I think what Paul meant by a ‘malicious’ answer is one that furthers its goals in a way that happens to be to the detriment of ours.
For most part, yes. And my first paragraph reply represents my reply to the meaning of ‘unFriendly’ rather than just the malicious subset thereof.
That is an interpretation that directly contradicts the description given—it isn’t compatible with not caring about the future beyond an hour—or, for that matter, actually being an ‘oracle’ at all. If it was the intended meaning then my responses elsewhere would not have been cautious agreement but instead something along the lines of:
I was thinking of some of those extremely bad questions that are sometimes proposed to be asked of an oracle AI: “Why don’t we just ask it how to make a lot of money.”, etc. Paul’s example of asking it to give the output that gets us to press the reward button falls into the same category (unless I’m misinterpreting what he meant there?).
My formulation of minimising difference is something like the folllowing:
Assume A is the answer given by the oracle
Predict what would happen to the world if the AI was replaced by a program that consisted of a billion NOPs and then something that output: A . Call this W1
When assessing different strategies predict what would happen in the world given that strategy call this W2.
Minimise the difference between W1 and W2
Is this a more succinct formulation or is it missing something?
So as to avoid duplication my interpretation of what you say here is in my reply to Eliezer. Let that constitute the implied disclaimer that any claims about safety apply only to that version.
An oracle programed correctly with these specifications seems to be safe—safe enough that all else being equal I’d be comfortable pressing flicking the on switch. The else that is not equal is that N bit output stream when humans are there. Humans are not reliably Friendly (both unreliable thinkers and potentially malicious) and giving them a the knowledge of superintelligence should be treated as though it is potentially as risky as releasing a recursively improving AI with equivalent levels of unreliability. So I’d flick the on switch if I could be sure that I would be the only one with access to it, as well as the select few others (like Eliezer) that I would trust with that power.
Strongly agree. For any agent that doesn’t have eccentric preferences over complex quantum configurations quantum uncertainty is be rolled up and treated the same way that ignorance based uncertainty is treated. In my own comments I tried to keep things technically correct and use caveats so as not to equivocate between the two but doing that all the time is annoying and probably distracting to readers.