Suppose the UFAI figures out a fake theorem that it would like us to believe, because it would lead us down a path of creating an AI it likes. If we were to ask it to prove this fake theorem, it would give back to us something that looks very much like a proof, so that we would miss the point where it goes wrong. Maybe we require a machine verifiable proof, but it takes advantage of a flaw in our automatic verifiers, or the way we interpret the results. So how does it get us to ask about this fake theorem? It might manipulate its proof of a theorem we do ask about to inspire us to ask that question. It might respond to our request for a proof with, “Well, that is not quite right, but if you make these modifications...”. Keep in mind, this is a hostile intelligence that is way beyond us. It will take any opportunity to subtly manipulate us that it gets. And these have only been the ideas that I, a mere human, could come up with.
I am not sure what sort of template you mean, but I suspect that it will have the same problem. Basically, room for the UFAI to use its superior intelligence to help us is room for betrayal.
Step back and check what are you arguing. The discussion is whether a scenario where the UFAI is helpful is at all plausible. Of course all sorts of stuff can go wrong. Of course it isn’t a good idea. The arguments saying “but this could go wrong too” don’t advance the discussion a bit, as it’s already understood.
I am not talking about some slim chance of something happening to go wrong. I am talking about a hostile super intelligence systematically arranging for us to make dangerous critical mistakes. I don’t know exactly what the UFAI will do anymore than I can predict which move Kasparov will make in a chess game, but I can predict with greater confidence than I can predict Kasparov would win a chess game against a mere grand master, that, given that we build and use a UFAI, we will be the bug and the UFAI will be the windshield. Splat!
If your standard here is if something is “at all plausible”, then the discussion is trivial, nothing has probability 0. However, none of the proposals for using UFAI discussed here are plans that I expect positive utility from. (Which is putting it mildly, I expect them all to result in the annihilation of humanity with high probability, and to produce good results only in the conjunction of many slim chances of things happening to go right.)
You should speak for yourself about what is already understood, what you think is not a good idea. Warrigal seems to think using UFAI can be a good idea.
Good, that is a critical insight. I will explore some of the implications of considering what kind of FAI it is.
By “FAI”, we refer to an AI that we can predict with high confidence will be friendly from our deep understanding of how it works. (Deep understanding is not actually part of the definition, but, without it, we are not likely to achieve high (well calibrated) confidence).
So, a kind of FAI that is built out of a UFAI subject to some constraints (or other means of making it friendly), would require us to understand the constraints and the UFAI sufficiently to predict that the UFAI would not be able to break the constraints, which it would of course try to do. (One might imagine constraints that are sufficient not to be broken by any level of intelligence, but I find this unlikely, as a major vulnerability is the possibility of giving us misinformation with regards to issues we don’t understand.)
Which is easier, to understand the UFAI we built without understanding it, and its constraints, to predict the system will be “friendly” (and to actually produce a system, including the UFAI, such that the prediction would be correct), or to directly build an FAI, that is designed from the ground up to be predictably friendly? And which system is more likely to wipe us out if we are wrong, rather than fail to do anything interesting?
I prefer the solution where friendliness is not a patch, but is a critical part of the intelligence itself.
Suppose the UFAI figures out a fake theorem that it would like us to believe, because it would lead us down a path of creating an AI it likes. If we were to ask it to prove this fake theorem, it would give back to us something that looks very much like a proof, so that we would miss the point where it goes wrong. Maybe we require a machine verifiable proof, but it takes advantage of a flaw in our automatic verifiers, or the way we interpret the results. So how does it get us to ask about this fake theorem? It might manipulate its proof of a theorem we do ask about to inspire us to ask that question. It might respond to our request for a proof with, “Well, that is not quite right, but if you make these modifications...”. Keep in mind, this is a hostile intelligence that is way beyond us. It will take any opportunity to subtly manipulate us that it gets. And these have only been the ideas that I, a mere human, could come up with.
I am not sure what sort of template you mean, but I suspect that it will have the same problem. Basically, room for the UFAI to use its superior intelligence to help us is room for betrayal.
Step back and check what are you arguing. The discussion is whether a scenario where the UFAI is helpful is at all plausible. Of course all sorts of stuff can go wrong. Of course it isn’t a good idea. The arguments saying “but this could go wrong too” don’t advance the discussion a bit, as it’s already understood.
I am not talking about some slim chance of something happening to go wrong. I am talking about a hostile super intelligence systematically arranging for us to make dangerous critical mistakes. I don’t know exactly what the UFAI will do anymore than I can predict which move Kasparov will make in a chess game, but I can predict with greater confidence than I can predict Kasparov would win a chess game against a mere grand master, that, given that we build and use a UFAI, we will be the bug and the UFAI will be the windshield. Splat!
If your standard here is if something is “at all plausible”, then the discussion is trivial, nothing has probability 0. However, none of the proposals for using UFAI discussed here are plans that I expect positive utility from. (Which is putting it mildly, I expect them all to result in the annihilation of humanity with high probability, and to produce good results only in the conjunction of many slim chances of things happening to go right.)
You should speak for yourself about what is already understood, what you think is not a good idea. Warrigal seems to think using UFAI can be a good idea.
When you figure out how to arrange the usage of UFAI that is a good idea, the whole contraption becomes a kind of FAI.
Good, that is a critical insight. I will explore some of the implications of considering what kind of FAI it is.
By “FAI”, we refer to an AI that we can predict with high confidence will be friendly from our deep understanding of how it works. (Deep understanding is not actually part of the definition, but, without it, we are not likely to achieve high (well calibrated) confidence).
So, a kind of FAI that is built out of a UFAI subject to some constraints (or other means of making it friendly), would require us to understand the constraints and the UFAI sufficiently to predict that the UFAI would not be able to break the constraints, which it would of course try to do. (One might imagine constraints that are sufficient not to be broken by any level of intelligence, but I find this unlikely, as a major vulnerability is the possibility of giving us misinformation with regards to issues we don’t understand.)
Which is easier, to understand the UFAI we built without understanding it, and its constraints, to predict the system will be “friendly” (and to actually produce a system, including the UFAI, such that the prediction would be correct), or to directly build an FAI, that is designed from the ground up to be predictably friendly? And which system is more likely to wipe us out if we are wrong, rather than fail to do anything interesting?
I prefer the solution where friendliness is not a patch, but is a critical part of the intelligence itself.