I’m sure Tim Tyler is familiar with CEV; I presume his objection is that CEV is not sufficiently clear or rigorous. Indeed, CEV is only semitechnical; I think the FAI research done by Eliezer and Marcello since CEV’s publication has included work on formalizing it mathematically, but that’s not available to the public.
Note also that defining the thing-we-want-an-AI-to-do is only half of the problem of Friendliness; the other half is solving the problems in decision theory that will allow us to prove that an AI’s goal system and decision algorithms will cause it to not change its goal system. If we build an AGI that implements the foundation of CEV but fails to quine itself, then during recursive self-improvement, its values may be lost before it stabilizes its goal system itself, and it will all be for naught.
Why exactly do we want “recursive self-improvement” anyways? Why not build into the architecture the impossibility of rewriting its own code, prove the “friendliness” of the software that we put there, and then push the ON button without qualms. And then, when we feel like it, we can ask our AI to design a more powerful successor to itself.
Then, we repeat the task of checking the security of the architecture and proving the friendliness of the software before we build and turn on the new AI.
There is no reason we have to have a “hard takeoff” if we don’t want one. What am I missing here?
Why exactly do we want “recursive self-improvement” anyways?
You get that in many goal-directed systems, whether you ask for it or not.
Why not build into the architecture the impossibility of rewriting its own code,
prove the “friendliness” of the software that we put there, and then push the
ON button without qualms.
Impossible is not easy to implement. You can make it difficult for a machine to improve itself, but then that just becomes a challenge that it must overcome in order to reach its goals. If the agent is sufficiently smart, it may find some way of doing it.
Many here think that if you have a sufficiently intelligent agent that wants to do something you don’t want it to do, you are probably soon going to find that it will find some way to get what it wants. Thus the interest in trying to get its goals and your goals better aligned.
Also, humans might well want to let the machine self-improve. They are in a race with competitiors; the machine says it can help with that, and it warns that—if the humans don’t let it—the competitiors are likely to pull ahead...
Why exactly do we want “recursive self-improvement” anyways?
Because we want more out of FAI than just lowercase-f friendly androids that we can rely upon not to rebel or break too badly. If we can figure out a rigorous Friendly goal system and a provably stable decision theory, then we should want to; then the world gets saved and the various current humanitarian emergencies get solved much quicker than they would if we didn’t know whether the AI’s goal system was stable and we had to check it at every stage and not let it impinge upon the world directly (not that that’s feasible anyway —)
Why not build into the architecture the impossibility of rewriting its own code, prove the “friendliness” of the software that we put there, and then push the ON button without qualms. And then, when we feel like it, we can ask our AI to design a more powerful successor to itself.
Then, we repeat the task of checking the security of the architecture and proving the friendliness of the software before we build and turn on the new AI.
Most likely, after each iteration, it would become more and more incomprehensible to us. Rice’s theorem suggests that we will not be able to prove the necessary properties of a system from the top down, not knowing how it was designed; that is a massively different problem than proving properties of a system we’re constructing from the bottom up. (The AI will know how it’s designing the code it writes, but the problem is making sure that it is willing and able to continuously prove that it is not modifying its goals.)
And, in the end, this is just another kind of AI-boxing. If an AI gets smart enough and it ends up deciding that it has some goals that would be best carried out by something smarter than itself, then it will probably get around any safeguards we put in place. It’ll emit some code that looks Friendly to us but isn’t, or some proof that is too massively complicated for us to check, or it’ll do something far too clever for a human like me to think of as an example. I’d say there’s a dangerously high possibility that an AI will be able to start a hard takeoff even if it doesn’t have access to its own code — it may be able to introspect and understand intelligence well enough that it could just write its own AI (if we can do that, then why can’t it?), and then push that AI out “into the wild” by the usual means (smooth-talk a human operator, invent molecular nanotech that assembles a computer that runs the new software, etc.).
Even trying to do it this way would likely be a huge waste of time (at best) — if we don’t build in a goal system that we know will preserve itself in the first place, then why would we expect its self-designed successor to preserve its goals?
If an AGI is not safe under recursive self-improvement, then it is not safe at all.
… we will not be able to prove the necessary properties of a system from the top down, not knowing how it was designed.
I guess I didn’t make clear that I was talking about proof-checking rather than proof-finding. And, of course, we ask the designer to find the proof—if it can’t provide one, then we (and it) have no reason to trust the design.
Doing it this way would also likely be a major waste of time — if we don’t build in a goal system that we know will preserve itself in the first place, then why would we expect its self-designed successor to preserve its goals?
If an AGI is not safe under recursive self-improvement, then it is not safe at all.
I may be a bit less optimistic than you that we will ever be able to prove the correctness of self-modifying programs. But assume that such proofs are possible, but we humans have not yet made the conceptual breakthroughs by the time we are ready to build our first super-human AI. But assume that we can prove friendliness for non-self-modifying programs.
In this case, proceeding as I suggest, and then asking the AI to help discover the missing proof technology, would not be wasting time—it would be saving time.
I guess I didn’t make clear that I was talking about proof-checking rather than proof-finding. And, of course, we ask the designer to find the proof—if it can’t provide one, then we (and it) have no reason to trust the design.
This still assumes in the first place that the AI will be motivated to design a successor that preserves its own goal system. If it wants to do this, or can be made to do this just by being told to, and you have a very good reason to believe this, then you’ve already solved the problem. We’re just not sure if that comes automatically — there are intuitive arguments that it does, like the one about Gandhi and the murder-pill, but I’m convinced that the stakes are high enough that we should prove this to be true before we push the On button on anything that’s smarter than us or could become smarter than us. The danger is that while you’re waiting for it to provide a new program and a proof of correctness to verify, it might instead decide to unbox itself and go off and do something with Friendly intentions but unstable self-modification mechanisms, and then we’ll end up with a really powerful optimization process with a goal system that only stabilizes after it’s become worthless. Or even if you have an AI with no goals other than truthfully answering questions, that’s still dangerous; you can ask it to design a provably-stable reflective decision theory, and perhaps it will try, but if it doesn’t already have a Friendly motivational mechanism, then it may go about finding the answer in less-than-agreeable ways. Again as per the Omohundro paper, we can expect recursive self-improvement to be pretty much automatic (whether or not it has access to its own code), and we don’t know if value-preservation is automatic, and we know that Friendliness is definitely not automatic, so creating a non-provably-stable or non-Friendly AI and trying to have it solve these problems is putting the cart before the horse, and there’s too great a risk of it backfiring.
Your final sentence is a slogan, not an argument.
It was neither; it was intended only as a summary of the conclusion of the points I was arguing in the preceding paragraphs.
Why exactly do we want “recursive self-improvement” anyways?
Generally we want our programs to be as effective as possible. If the program can improve itself, that’s a good thing, from an ordinary perspective.
But for a sufficiently sophisticated program, you don’t even need to make self-improvement an explicit imperative. All it has to do is deduce that improving its own performance will lead to better outcomes. This is in the paper by Steve Omohundro (ata’s final link).
Why not build into the architecture the impossibility of rewriting its own code
There are too many possibilities. The source code might be fixed, but the self-improvement occurs during run-time via alterations to dynamical objects—data structures, sets of heuristics, virtual machines. An AI might create a new and improved AI rather than improving itself. As Omohundro argues, just having a goal, any goal at all, gives an AI an incentive to increase the amount of intelligence being used in the service of that goal. For a complicated architecture, you would have to block this incentive explicitly, declaratively, at a high conceptual level.
I think the FAI research done by Eliezer and Marcello since CEV’s publication has included work on formalizing it mathematically, but that’s not available to the public
I’m curious—where did you hear this, if it’s not available to the public? And why isn’t it available to the public? And who’s Marcello? There seems to be virtually no information in public circulation about what’s actually going on as far as progress towards implementing CEV/FAI.… is current progress being kept secret, or am I just not in the loop? And how does one go about getting in the loop?
Marcello is Marcello Herreshoff, a math genius and all around cool guy who is Eliezer’s apprentice/coworker. Eliezer has mentioned on LW that he and Marcello “work[ed] for a year on AI theory”, and from conversations about these things when I was at Benton(/SIAI House) for a weekend, I got the impression that some of this work included expanding on and formalizing CEV, though I could be misremembering.
(Regarding “where did you hear this, if it’s not available to the public?” — I don’t think the knowledge that this research happened is considered a secret, only the content of it is. And I am not party to any of that content, because I am still merely a wannabe FAI researcher.)
Note also that defining the thing-we-want-an-AI-to-do is only half of the problem of Friendliness; the other half is solving the problems in decision theory that will allow us to prove that an AI’s goal system and decision algorithms will cause it to not change its goal system and decision algorithms.
My understanding is that Eliezer considers this second part to be a substantially easier problem.
Probably the closest thing I have seen to a definition of “friendly” from E.Y. is:
“The term “Friendly AI” refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.”
That appears to make Deep Blue “friendly”. It hasn’t harmed too many people so far—though maybe Kasparov’s ego got a little bruised.
Another rather different attempt:
“I use the term “Friendly AI” to refer to this whole challenge. Creating a mind that doesn’t kill people but does cure cancer …which is a rather limited way of putting it. More generally, the problem of pulling a mind out of mind design space, such that afterwards that you are glad you did it.”
...that one has some pretty obvious problems, as I describe here.
These are not operational definitions. For example, both rely on some kind of unspecified definition of what a “person” is. That maybe obvious today—but human nature will probably be putty in the hands of an intelligent machine—and it may well start wondering about the best way to gently transform a person into a non-person.
Ok, I might have been a bit overenthusiastic with how simple “friendly” aspect is, but here is a good attempt at describing what we want.
I’m sure Tim Tyler is familiar with CEV; I presume his objection is that CEV is not sufficiently clear or rigorous. Indeed, CEV is only semitechnical; I think the FAI research done by Eliezer and Marcello since CEV’s publication has included work on formalizing it mathematically, but that’s not available to the public.
Note also that defining the thing-we-want-an-AI-to-do is only half of the problem of Friendliness; the other half is solving the problems in decision theory that will allow us to prove that an AI’s goal system and decision algorithms will cause it to not change its goal system. If we build an AGI that implements the foundation of CEV but fails to quine itself, then during recursive self-improvement, its values may be lost before it stabilizes its goal system itself, and it will all be for naught.
Why exactly do we want “recursive self-improvement” anyways? Why not build into the architecture the impossibility of rewriting its own code, prove the “friendliness” of the software that we put there, and then push the ON button without qualms. And then, when we feel like it, we can ask our AI to design a more powerful successor to itself.
Then, we repeat the task of checking the security of the architecture and proving the friendliness of the software before we build and turn on the new AI.
There is no reason we have to have a “hard takeoff” if we don’t want one. What am I missing here?
You get that in many goal-directed systems, whether you ask for it or not.
Impossible is not easy to implement. You can make it difficult for a machine to improve itself, but then that just becomes a challenge that it must overcome in order to reach its goals. If the agent is sufficiently smart, it may find some way of doing it.
Many here think that if you have a sufficiently intelligent agent that wants to do something you don’t want it to do, you are probably soon going to find that it will find some way to get what it wants. Thus the interest in trying to get its goals and your goals better aligned.
Also, humans might well want to let the machine self-improve. They are in a race with competitiors; the machine says it can help with that, and it warns that—if the humans don’t let it—the competitiors are likely to pull ahead...
Because we want more out of FAI than just lowercase-f friendly androids that we can rely upon not to rebel or break too badly. If we can figure out a rigorous Friendly goal system and a provably stable decision theory, then we should want to; then the world gets saved and the various current humanitarian emergencies get solved much quicker than they would if we didn’t know whether the AI’s goal system was stable and we had to check it at every stage and not let it impinge upon the world directly (not that that’s feasible anyway —)
Most likely, after each iteration, it would become more and more incomprehensible to us. Rice’s theorem suggests that we will not be able to prove the necessary properties of a system from the top down, not knowing how it was designed; that is a massively different problem than proving properties of a system we’re constructing from the bottom up. (The AI will know how it’s designing the code it writes, but the problem is making sure that it is willing and able to continuously prove that it is not modifying its goals.)
And, in the end, this is just another kind of AI-boxing. If an AI gets smart enough and it ends up deciding that it has some goals that would be best carried out by something smarter than itself, then it will probably get around any safeguards we put in place. It’ll emit some code that looks Friendly to us but isn’t, or some proof that is too massively complicated for us to check, or it’ll do something far too clever for a human like me to think of as an example. I’d say there’s a dangerously high possibility that an AI will be able to start a hard takeoff even if it doesn’t have access to its own code — it may be able to introspect and understand intelligence well enough that it could just write its own AI (if we can do that, then why can’t it?), and then push that AI out “into the wild” by the usual means (smooth-talk a human operator, invent molecular nanotech that assembles a computer that runs the new software, etc.).
Even trying to do it this way would likely be a huge waste of time (at best) — if we don’t build in a goal system that we know will preserve itself in the first place, then why would we expect its self-designed successor to preserve its goals?
If an AGI is not safe under recursive self-improvement, then it is not safe at all.
I guess I didn’t make clear that I was talking about proof-checking rather than proof-finding. And, of course, we ask the designer to find the proof—if it can’t provide one, then we (and it) have no reason to trust the design.
I may be a bit less optimistic than you that we will ever be able to prove the correctness of self-modifying programs. But assume that such proofs are possible, but we humans have not yet made the conceptual breakthroughs by the time we are ready to build our first super-human AI. But assume that we can prove friendliness for non-self-modifying programs.
In this case, proceeding as I suggest, and then asking the AI to help discover the missing proof technology, would not be wasting time—it would be saving time.
Your final sentence is a slogan, not an argument.
This still assumes in the first place that the AI will be motivated to design a successor that preserves its own goal system. If it wants to do this, or can be made to do this just by being told to, and you have a very good reason to believe this, then you’ve already solved the problem. We’re just not sure if that comes automatically — there are intuitive arguments that it does, like the one about Gandhi and the murder-pill, but I’m convinced that the stakes are high enough that we should prove this to be true before we push the On button on anything that’s smarter than us or could become smarter than us. The danger is that while you’re waiting for it to provide a new program and a proof of correctness to verify, it might instead decide to unbox itself and go off and do something with Friendly intentions but unstable self-modification mechanisms, and then we’ll end up with a really powerful optimization process with a goal system that only stabilizes after it’s become worthless. Or even if you have an AI with no goals other than truthfully answering questions, that’s still dangerous; you can ask it to design a provably-stable reflective decision theory, and perhaps it will try, but if it doesn’t already have a Friendly motivational mechanism, then it may go about finding the answer in less-than-agreeable ways. Again as per the Omohundro paper, we can expect recursive self-improvement to be pretty much automatic (whether or not it has access to its own code), and we don’t know if value-preservation is automatic, and we know that Friendliness is definitely not automatic, so creating a non-provably-stable or non-Friendly AI and trying to have it solve these problems is putting the cart before the horse, and there’s too great a risk of it backfiring.
It was neither; it was intended only as a summary of the conclusion of the points I was arguing in the preceding paragraphs.
Generally we want our programs to be as effective as possible. If the program can improve itself, that’s a good thing, from an ordinary perspective.
But for a sufficiently sophisticated program, you don’t even need to make self-improvement an explicit imperative. All it has to do is deduce that improving its own performance will lead to better outcomes. This is in the paper by Steve Omohundro (ata’s final link).
There are too many possibilities. The source code might be fixed, but the self-improvement occurs during run-time via alterations to dynamical objects—data structures, sets of heuristics, virtual machines. An AI might create a new and improved AI rather than improving itself. As Omohundro argues, just having a goal, any goal at all, gives an AI an incentive to increase the amount of intelligence being used in the service of that goal. For a complicated architecture, you would have to block this incentive explicitly, declaratively, at a high conceptual level.
I’m curious—where did you hear this, if it’s not available to the public? And why isn’t it available to the public? And who’s Marcello? There seems to be virtually no information in public circulation about what’s actually going on as far as progress towards implementing CEV/FAI.… is current progress being kept secret, or am I just not in the loop? And how does one go about getting in the loop?
Marcello is Marcello Herreshoff, a math genius and all around cool guy who is Eliezer’s apprentice/coworker. Eliezer has mentioned on LW that he and Marcello “work[ed] for a year on AI theory”, and from conversations about these things when I was at Benton(/SIAI House) for a weekend, I got the impression that some of this work included expanding on and formalizing CEV, though I could be misremembering.
(Regarding “where did you hear this, if it’s not available to the public?” — I don’t think the knowledge that this research happened is considered a secret, only the content of it is. And I am not party to any of that content, because I am still merely a wannabe FAI researcher.)
My understanding is that Eliezer considers this second part to be a substantially easier problem.
Probably the closest thing I have seen to a definition of “friendly” from E.Y. is:
“The term “Friendly AI” refers to the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals.”
http://singinst.org/ourresearch/publications/CFAI/challenge.html
That appears to make Deep Blue “friendly”. It hasn’t harmed too many people so far—though maybe Kasparov’s ego got a little bruised.
Another rather different attempt:
“I use the term “Friendly AI” to refer to this whole challenge. Creating a mind that doesn’t kill people but does cure cancer …which is a rather limited way of putting it. More generally, the problem of pulling a mind out of mind design space, such that afterwards that you are glad you did it.”
here, 29 minutes in
...that one has some pretty obvious problems, as I describe here.
These are not operational definitions. For example, both rely on some kind of unspecified definition of what a “person” is. That maybe obvious today—but human nature will probably be putty in the hands of an intelligent machine—and it may well start wondering about the best way to gently transform a person into a non-person.