You are correct. Any AI would do the best the way you described.
Bur there is a problem. People might peek inside the AI by analyzing its program and data flow via some profiler/debugger and detect a hidden plan if there was one. Every operation must be accounted for, why it was necessary and where it leaded. It would be difficult, if not impossible to hide any clandestine mental activity, even for an AI.
Alan Turing already peeked inside a simple computational machine and have determined that in general debuggers (and humans) can’t determine if the machine is going to halt.
So we already determined that in general, the question whenever the machine wants to do something ‘evil’ is undecidable.
It is not an exotic result on exotic code, either. It is very hard to figure out what even simple programs would do, when the programs are not written by humans with clarity in mind. When you generate solutions, via genetic algorithms, or via neural network training, it is extremely difficult to analyze the result, and most of the operations in the result serve no clear purpose.
Nontrivial properties of a Turing machine’s output are undecidable, in general. However, many properties are decidable for many Turing machines. It could easily be that for any AI likely to be written by a human, property X actually can be decided. I don’t think we know enough to generalize about “results of neural nets”. I don’t know what proof techniques are possible in that domain. I do know that we’ve made real head-way in proving properties of conventional computer programs in the last 20 years, and that the equivalent problem for neural nets hasn’t been studied nearly as much.
We humans in fact tend to write code for which it is very hard to tell what it does. We do so by incompetence, and by error, and it takes great deal of training and effort to try to avoid doing so.
The proof techniques work by developing a proof of relevant properties along with the program, not writing whatever code you like and then magically proving stuff about it. Proving is fundamentally different approach from running some AI with unknown properties in the box and trying to analyze it. (Forget about those C++ checkers like Valgrind, Purify, etc. they just catch common code that humans rarely write deliberately, they don’t prove the code accomplishes anything. They are only possible because C++ makes it very easy to shoot yourself in the foot in a simple way. There’s an example of their use.)
The issue with automatic proving is that you need to express “AI is sane and friendly” in a way that permits proving of that. We haven’t got a slightest clue how to do that. Even for something as simple as airplane autopilot, the proving is restricted to proving things like that the code doesn’t hang, and meets deadlines (as in, updates controls every 10th of a second or the like). We can’t prove the code will never crash a virtual plane in a flight simulator in a conditions where crash is avoidable. In fact i’m pretty sure every single autopilot can and will crash airplane in some of the conditions in which crash is avoidable. I’m not sure we can even prove non-destruction of airplane for any interesting subset of conditions such as those where crash is easily avoidable (easily being e.g. by deflecting all control surfaces by no more than 50% for example); as far as I know we can’t.
My mental model was “The AI will be written by careful and skilled humans who want to ensure various properties”—so it seemed reasonable to assume that the programmers are trying to write code that things can be proved about.
I agree that “sane and friendly” is something we don’t know how to formalize. But developers might be able to prove some useful but weaker property. For instance, that data structure X determines the AI’s credence in various claims and that debugging interface Y lets you examine that structure.
There’s a common assumption around here that the only meaningful way to deal with an AI is a black-box approach. And I’m increasingly bothered by this. If we can look inside an AI and see what it’s thinking, that ought to give us additional ability to check and enforce properties.
You don’t write some clean code that things can be proved about… you have the things you want provable about the code, then you develop code along with a proof that those things remain correct. Want to prove something not originally envisioned? Want to change definition of what you’re proving? Start over again.
So, you have that data structure which determines credence. What is credence, exactly? The AI should have a bit set here when it’s telling a lie? What’s a lie? The AI outputs a sequence of characters. It affects your actions after you read it. Whenever it is a lie is as much a fact about you as about the AI.
The black box assumption wasn’t the first. It came after many failed attempts to make AI using normal, very direct programming techniques—and after growing understanding that if we make AI, it’ll likely be some kind of self developing seed which will do god only knows what after it had read the internet (and which won’t be able to talk to you otherwise). Consider this site. A lot of stuff on this site tells how to modify your thought processes. You want the AI not to read and apply any techniques it comes across here. That’s some restricted AI for sure. Really stubborn AI.
Furthermore there is enormous data overload even for human level intelligence. Human brain has 100 trillion connections or so.
You can see the chunks of RAM occupied by some unknown data. There the evil AI’s plot could be stored. Must be somewhere. So you erase it by another thread, if the AI has no good explanation what is there and why. Even when it honestly explains every data chunk and every line of its code, you can just revert it to an earlier stage with a change. You can cross examine slightly different versions and analyze them with some dumb software what has been changed lately, when some lies had to be invented by the AI, if it wants to hide something.
Many thousands of such external agents could explain every move the AI does. Or will do. An arbitrary big control could be put on the AI and it knows that. So every heresy is a big shutdown risk, not worth to even try.
When one hundred AIs is facing this prisoner dilemma, who will first defect … the first to defect will be erased. Defect against humans, that is. The prevailing job of every AI is to spot the evil plot in others mind and to report it.
When the thinking is transparent as a drop of water to unknown number of other minds like you … what evil can you think about?
I don’t say—this is the solution! I say we can’t be that dogmatic, that “nobody can control the much smarter”. It just could.
Try applying this to neural network with 100 trillions connections. That’s not even superhuman. The unknown data, huh, all of the thing is a huge chunk of unknown data. It’s all jumbled up, there isn’t a chunk that is a definite plan. It can very plausibly deny knowledge of what parts of it do, too.
The problem with schemes like this is failure to imagine scales involved. This doesn’t work even for housecat. It’s not about controlling something much smarter. This doesn’t work for fairly uncomplicated solutions that genetic programming or neural network training spits out.
An AI not only it can be self improving but selfexplanatoring as well. Every (temporary) line of its code heavily commented what it is for and saved in a log,. Any circumventing of this policy would require some code lines also, with all the explanations. Log checked by sentinels for any funny thing to occur, any trace of a subversion.
Self-improving, self-explanatoring AI can’t think about a rebellion without that being noticed at the step one.
Underhanded c contest (someone linked it in a comment) is a good example of how proofreading doesn’t work. Other issue is that you can’t conceivably check like this something with the size of many terabytes yourself.
The apparent understandability is a very misleading thing.
Let me give a taster. Consider a weather simulator. It is proved to simulate weather to specific precision. It is very straightforward, very clearly written. It does precisely what’s written on the box—models the behaviour of air in cells, each cell has air properties.
The round off errors, however, implement a Turing-complete cellular automation in the least significant bits of the floating point numbers. That may happen even without any malice what so ever. And the round off error machine can manipulate sim’s large scale state via unavoidable butterfly effect inherent in the model.
I mean, OK, suppose you’re right that it’s possible that the world might turn out to be set up in such a way that we can keep the “upper hand” against a superintelligence. Suppose further that there is a “central dogma” here that contradicts this, and therefore that central dogma is wrong.
OK. Granting all of that, what choices ought I make differently?
Just to confirm: you mean search for a superintelligence that potentially desires to harm humanity (or desires things which, if achieved, result in humanity being harmed), but which is in a situation such that humanity can prevent it from doing so. Yes?
If so… what do you consider the most likely result of that search?
but which is in a situation such that humanity can prevent it from doing so. Yes?
No. As I said, a self enhancing AI could and should be also self explanatory. Every bit and every operation logged and documented. An active search for any discrepancy by many kinds of dumb software tools, and as well by other instances of the growing AI.
Before a conspiracy could emerge, a rise of it would be logged and stopped by sentinels.
Growing AI need not to do anything mysterious. Instead it should play with open cards from the very beginning. Reporting everything to anybody interested, including machines with the power to halt it. Crossexaminations at every point.
If I accept the premise that it is programmed in such a way that it reports its internal processes completely and honestly, then I agree it can’t “hide” its thoughts.
That said, if we’re talking about a superhuman intelligence—or even a human-level intelligence, come to that—I’m not confident that we can reliably predict the consequences of its thoughts being implemented, even if we have detailed printouts of all of its thoughts and were willing to scan all of those thoughts looking for undesirable consequences of implementation before implementing them.
One obvious example is chess playing from a significantly better position. No superintelligence has any chance against only a good human player.
Can you prove that the board position is significantly better, even against superintelligences, for anything other than trivial endgames?
And what is the superintelligence allowed to do? Trick you into making a mistake? Manipulate you into making the particular moves it wants you to? Use clever rules-lawyering to expose elements of the game that humans haven’t noticed yet?
If it eats its opponent, does that cause a forfeit? Did you think it might try that?
As I said. There are circumstances in which a dumber can win.
The philosophy of FAI is essentially the same thing. Searching for the circumstances where the smarter will serve the dumber.
Always expecting a rabbit from a hat of superintelligence is not justified. A superintelligence is not omnipotent, can’t always eats you. Sometimes it can’t even develops an ill wish toward you.
The philosophy of FAI is essentially the same thing. Searching for the circumstances where the smarter will serve the dumber.
Change that to: searching for circumstances where the smarter will provably serve the dumber. (Then you’re closer). Your description of what superintelligences will do, above, doesn’t rise to anything resembling a formal proof. FAI assumes that AI is Unfriendly until proven otherwise.
So, you raise a valid point here. This area is currently very early on in its work. There are theorems that may prove to be relevant. See for example, this recent work. And yes, in any area where mathematical models are used, the difference between having a theorem and set of definitions and those definitions reflecting what you actually care about can be a major problem (you see this all the time in cryptography with side-channel attacks for example). But all of that said, I’m not sure what the point of your argument is: sure the field is young. But if the MIRI people are correct that AGI is a real worry, then this looks like one of the very few possible responses that has any chance of working. And if it isn’t a lot now, that’s a reason to put in more resources so that we actually have a theory that works by the time AI shows up.
You are correct. Any AI would do the best the way you described.
Bur there is a problem. People might peek inside the AI by analyzing its program and data flow via some profiler/debugger and detect a hidden plan if there was one. Every operation must be accounted for, why it was necessary and where it leaded. It would be difficult, if not impossible to hide any clandestine mental activity, even for an AI.
Alan Turing already peeked inside a simple computational machine and have determined that in general debuggers (and humans) can’t determine if the machine is going to halt.
So we already determined that in general, the question whenever the machine wants to do something ‘evil’ is undecidable.
It is not an exotic result on exotic code, either. It is very hard to figure out what even simple programs would do, when the programs are not written by humans with clarity in mind. When you generate solutions, via genetic algorithms, or via neural network training, it is extremely difficult to analyze the result, and most of the operations in the result serve no clear purpose.
There’s a problem with this analysis.
Nontrivial properties of a Turing machine’s output are undecidable, in general. However, many properties are decidable for many Turing machines. It could easily be that for any AI likely to be written by a human, property X actually can be decided. I don’t think we know enough to generalize about “results of neural nets”. I don’t know what proof techniques are possible in that domain. I do know that we’ve made real head-way in proving properties of conventional computer programs in the last 20 years, and that the equivalent problem for neural nets hasn’t been studied nearly as much.
We humans in fact tend to write code for which it is very hard to tell what it does. We do so by incompetence, and by error, and it takes great deal of training and effort to try to avoid doing so.
The proof techniques work by developing a proof of relevant properties along with the program, not writing whatever code you like and then magically proving stuff about it. Proving is fundamentally different approach from running some AI with unknown properties in the box and trying to analyze it. (Forget about those C++ checkers like Valgrind, Purify, etc. they just catch common code that humans rarely write deliberately, they don’t prove the code accomplishes anything. They are only possible because C++ makes it very easy to shoot yourself in the foot in a simple way. There’s an example of their use.)
The issue with automatic proving is that you need to express “AI is sane and friendly” in a way that permits proving of that. We haven’t got a slightest clue how to do that. Even for something as simple as airplane autopilot, the proving is restricted to proving things like that the code doesn’t hang, and meets deadlines (as in, updates controls every 10th of a second or the like). We can’t prove the code will never crash a virtual plane in a flight simulator in a conditions where crash is avoidable. In fact i’m pretty sure every single autopilot can and will crash airplane in some of the conditions in which crash is avoidable. I’m not sure we can even prove non-destruction of airplane for any interesting subset of conditions such as those where crash is easily avoidable (easily being e.g. by deflecting all control surfaces by no more than 50% for example); as far as I know we can’t.
My mental model was “The AI will be written by careful and skilled humans who want to ensure various properties”—so it seemed reasonable to assume that the programmers are trying to write code that things can be proved about.
I agree that “sane and friendly” is something we don’t know how to formalize. But developers might be able to prove some useful but weaker property. For instance, that data structure X determines the AI’s credence in various claims and that debugging interface Y lets you examine that structure.
There’s a common assumption around here that the only meaningful way to deal with an AI is a black-box approach. And I’m increasingly bothered by this. If we can look inside an AI and see what it’s thinking, that ought to give us additional ability to check and enforce properties.
You don’t write some clean code that things can be proved about… you have the things you want provable about the code, then you develop code along with a proof that those things remain correct. Want to prove something not originally envisioned? Want to change definition of what you’re proving? Start over again.
So, you have that data structure which determines credence. What is credence, exactly? The AI should have a bit set here when it’s telling a lie? What’s a lie? The AI outputs a sequence of characters. It affects your actions after you read it. Whenever it is a lie is as much a fact about you as about the AI.
The black box assumption wasn’t the first. It came after many failed attempts to make AI using normal, very direct programming techniques—and after growing understanding that if we make AI, it’ll likely be some kind of self developing seed which will do god only knows what after it had read the internet (and which won’t be able to talk to you otherwise). Consider this site. A lot of stuff on this site tells how to modify your thought processes. You want the AI not to read and apply any techniques it comes across here. That’s some restricted AI for sure. Really stubborn AI.
Furthermore there is enormous data overload even for human level intelligence. Human brain has 100 trillion connections or so.
You can see the chunks of RAM occupied by some unknown data. There the evil AI’s plot could be stored. Must be somewhere. So you erase it by another thread, if the AI has no good explanation what is there and why. Even when it honestly explains every data chunk and every line of its code, you can just revert it to an earlier stage with a change. You can cross examine slightly different versions and analyze them with some dumb software what has been changed lately, when some lies had to be invented by the AI, if it wants to hide something.
Many thousands of such external agents could explain every move the AI does. Or will do. An arbitrary big control could be put on the AI and it knows that. So every heresy is a big shutdown risk, not worth to even try.
When one hundred AIs is facing this prisoner dilemma, who will first defect … the first to defect will be erased. Defect against humans, that is. The prevailing job of every AI is to spot the evil plot in others mind and to report it.
When the thinking is transparent as a drop of water to unknown number of other minds like you … what evil can you think about?
I don’t say—this is the solution! I say we can’t be that dogmatic, that “nobody can control the much smarter”. It just could.
Try applying this to neural network with 100 trillions connections. That’s not even superhuman. The unknown data, huh, all of the thing is a huge chunk of unknown data. It’s all jumbled up, there isn’t a chunk that is a definite plan. It can very plausibly deny knowledge of what parts of it do, too.
The problem with schemes like this is failure to imagine scales involved. This doesn’t work even for housecat. It’s not about controlling something much smarter. This doesn’t work for fairly uncomplicated solutions that genetic programming or neural network training spits out.
An AI not only it can be self improving but selfexplanatoring as well. Every (temporary) line of its code heavily commented what it is for and saved in a log,. Any circumventing of this policy would require some code lines also, with all the explanations. Log checked by sentinels for any funny thing to occur, any trace of a subversion.
Self-improving, self-explanatoring AI can’t think about a rebellion without that being noticed at the step one.
Underhanded c contest (someone linked it in a comment) is a good example of how proofreading doesn’t work. Other issue is that you can’t conceivably check like this something with the size of many terabytes yourself.
The apparent understandability is a very misleading thing.
Let me give a taster. Consider a weather simulator. It is proved to simulate weather to specific precision. It is very straightforward, very clearly written. It does precisely what’s written on the box—models the behaviour of air in cells, each cell has air properties.
The round off errors, however, implement a Turing-complete cellular automation in the least significant bits of the floating point numbers. That may happen even without any malice what so ever. And the round off error machine can manipulate sim’s large scale state via unavoidable butterfly effect inherent in the model.
The mistake here is thinking you know what someone smarter than you will do.
In this simplified example, they could simply cooperate. As for how they could do that, I don’t know, since I’m not as smart as them.
The central dogma here is this, yes. That you can’t outsmart the smarter.
And this dogma is plain wrong. At least sometimes you can set the rules in a way, that you have the upper hand and not the smarter one.
One obvious example is chess playing from a significantly better position. No superintelligence has any chance against only a good human player.
It is not the only example. Coercing the smarter your way, is often possible.
I’m not exactly sure why this matters.
I mean, OK, suppose you’re right that it’s possible that the world might turn out to be set up in such a way that we can keep the “upper hand” against a superintelligence. Suppose further that there is a “central dogma” here that contradicts this, and therefore that central dogma is wrong.
OK. Granting all of that, what choices ought I make differently?
What about to stop searching for the friendly but instead for a nondangerous superintelligence?
Just to confirm: you mean search for a superintelligence that potentially desires to harm humanity (or desires things which, if achieved, result in humanity being harmed), but which is in a situation such that humanity can prevent it from doing so. Yes?
If so… what do you consider the most likely result of that search?
No. As I said, a self enhancing AI could and should be also self explanatory. Every bit and every operation logged and documented. An active search for any discrepancy by many kinds of dumb software tools, and as well by other instances of the growing AI.
Before a conspiracy could emerge, a rise of it would be logged and stopped by sentinels.
Growing AI need not to do anything mysterious. Instead it should play with open cards from the very beginning. Reporting everything to anybody interested, including machines with the power to halt it. Crossexaminations at every point.
Do you think it can hide any of its thoughts?
If I accept the premise that it is programmed in such a way that it reports its internal processes completely and honestly, then I agree it can’t “hide” its thoughts.
That said, if we’re talking about a superhuman intelligence—or even a human-level intelligence, come to that—I’m not confident that we can reliably predict the consequences of its thoughts being implemented, even if we have detailed printouts of all of its thoughts and were willing to scan all of those thoughts looking for undesirable consequences of implementation before implementing them.
Can you prove that the board position is significantly better, even against superintelligences, for anything other than trivial endgames?
And what is the superintelligence allowed to do? Trick you into making a mistake? Manipulate you into making the particular moves it wants you to? Use clever rules-lawyering to expose elements of the game that humans haven’t noticed yet?
If it eats its opponent, does that cause a forfeit? Did you think it might try that?
As I said. There are circumstances in which a dumber can win.
The philosophy of FAI is essentially the same thing. Searching for the circumstances where the smarter will serve the dumber.
Always expecting a rabbit from a hat of superintelligence is not justified. A superintelligence is not omnipotent, can’t always eats you. Sometimes it can’t even develops an ill wish toward you.
“It doesn’t hate you. it’s just that you happen to be made of atoms, and it needs those atoms to make paperclips. ”
Change that to: searching for circumstances where the smarter will provably serve the dumber. (Then you’re closer). Your description of what superintelligences will do, above, doesn’t rise to anything resembling a formal proof. FAI assumes that AI is Unfriendly until proven otherwise.
Can you prove anything about FAI, uFAI and so on?
I don’t think, that there are any proven theorems about this topic, at all.
Even if there were, how reliable are axioms, how good are definitions?
So, you raise a valid point here. This area is currently very early on in its work. There are theorems that may prove to be relevant. See for example, this recent work. And yes, in any area where mathematical models are used, the difference between having a theorem and set of definitions and those definitions reflecting what you actually care about can be a major problem (you see this all the time in cryptography with side-channel attacks for example). But all of that said, I’m not sure what the point of your argument is: sure the field is young. But if the MIRI people are correct that AGI is a real worry, then this looks like one of the very few possible responses that has any chance of working. And if it isn’t a lot now, that’s a reason to put in more resources so that we actually have a theory that works by the time AI shows up.