Hello! quinesie here. I discovered LessWrong after being linked to HP&MoR, enjoying it and then following the links back to the LessWrong site itself. I’ve been reading for a while, but, as a rule, I don’t sign up with a site unless I have something worth contributing. After reading Eliezer’s Hidden Complexity of Wishes post, I think I have that:
In the post, Eliezer describes a device called an Outcome Pump, which resets the universe repeatedly until the desired outcome occurs. He then goes on to describe why this is a bad idea, since it can’t understand what it is that you really want, in a way that is analogous to unFriendly AI being programed to naively maximize something (like paper clips) that humans say they want maximized even when they really something much more complex that they have trouble articulating well enough to describe to a machine.
My idea, then, is to take the Outcome Pump and make a 2.0 version that uses the same mechanism as the orginal Outcome Pump, but with a slightly different trigger mechanism: The Outcome Pump resets the universe whenever a set period of time passes without an “Accept Outcome” button being pressed to prevent the reset. To convert back to AI theory, the analogous AI would be one which simulates the world around it, reports the projected outcome to a human and then waits for the results to be accepted or rejected. If accepted, it implements the solution. If rejected, it goes back to the drawing board and crunches numbers until it arrives at the next non-rejected solution.
This design could of course be improved upon by adding in parameters to automatically reject outcomes with are obviously unsuitable, or which contain events, ceteris paribus, we would prefer to avoid, just as with the standard Outcome Pump and its analogue in unFriendly AI. The chief difference between the two is that the failure mode for version 2.0 isn’t a catastrophic “tile the universe with paper clips/launch mother out of burning building with explosion” but rather the far more benign “submit utterly inane proposals until given more specific instructions or turned off”.
This probably has some terrible flaw in it that I’m overlooking, of course, since I am not an expert in the field, but if there is, the flaws aren’t obvious enough for a layman to see. Or, just as likely, someone else came up with it first and published a paper describing exactly this. So I’m asking here.
The Outcome Pump resets the universe whenever a set period of time passes without an “Accept Outcome” button being pressed to prevent the reset.
This creates a universe where the Accept Outcome button gets pressed, not necessarily one that has a positive outcome. e.g. if the button was literally a button, something might fall on to it; or if it was a state in a computer, a cosmic ray might flip a bit.
True enough, but once we step outside of the thought experiment and take a look at the idea it is intended to represent, “button gets pressed” translates into “humanity gets convinced to accept the machine’s proposal”. Since the AI-analogue device has no motives or desires save to model the universe as perfectly as possible, P(A bit flips in the AI that leads to it convincing a human panel to do something bad) necessarily drops below P(A bit flips anywhere that leads to a human panel deciding to do something bad) and is discountable for the same reason why we ignore hypothesises like “Maybe a cosmic ray flipped a bit to make it do that?” when figuring out the source of computer errors in general.
P(A bit flips in the AI that leads to it convincing a human panel to do something bad) is always less than P(A bit flips anywhere that leads to a human panel deciding to do something bad), (the former is a subset of the latter).
The point of the cosmic ray statement is not so much that that might actually happen, but is just demonstrating that the Outcome-Pump-2.0-universe doesn’t necessarily result in a positive outcome, just that it is a universe that has had the “Outcome” accepted, and also that the Outcome being accepted doesn’t imply that the universe is one we like.
In this document from 2004 Yudkowsky describes a safeguard to be added “on top of” programming Friendliness, a Last Judge. The idea is that the FAI’s goal is initially only to compute what an FAI should do. Then the Last Judge looks at the FAI’s report, and decides whether or not to switch the AI’s goal system to implement the described world. The document should not be taken as representative of Yudkosky’s current views, because it’s been marked obsolete, but I favor the idea of having a Last Judge check to make sure before anybody hits the red button.
The answer to that depends on how the time machine inside works. If it’s based on a “reset unless a message from the future is received saying not to” sort of deal, then you’re fine. Otherwise, you die. And neither situation has an analoge in the related AI design.
I don’t think it prevents the wireheading scenario that many people consider undesirable. For instance, if an AI modifies everybody into drooling idiots who are made deliriously happy by pressing “Accept Outcome” as often and forcefully as possible, it wins.
Hello! quinesie here. I discovered LessWrong after being linked to HP&MoR, enjoying it and then following the links back to the LessWrong site itself. I’ve been reading for a while, but, as a rule, I don’t sign up with a site unless I have something worth contributing. After reading Eliezer’s Hidden Complexity of Wishes post, I think I have that:
In the post, Eliezer describes a device called an Outcome Pump, which resets the universe repeatedly until the desired outcome occurs. He then goes on to describe why this is a bad idea, since it can’t understand what it is that you really want, in a way that is analogous to unFriendly AI being programed to naively maximize something (like paper clips) that humans say they want maximized even when they really something much more complex that they have trouble articulating well enough to describe to a machine.
My idea, then, is to take the Outcome Pump and make a 2.0 version that uses the same mechanism as the orginal Outcome Pump, but with a slightly different trigger mechanism: The Outcome Pump resets the universe whenever a set period of time passes without an “Accept Outcome” button being pressed to prevent the reset. To convert back to AI theory, the analogous AI would be one which simulates the world around it, reports the projected outcome to a human and then waits for the results to be accepted or rejected. If accepted, it implements the solution. If rejected, it goes back to the drawing board and crunches numbers until it arrives at the next non-rejected solution.
This design could of course be improved upon by adding in parameters to automatically reject outcomes with are obviously unsuitable, or which contain events, ceteris paribus, we would prefer to avoid, just as with the standard Outcome Pump and its analogue in unFriendly AI. The chief difference between the two is that the failure mode for version 2.0 isn’t a catastrophic “tile the universe with paper clips/launch mother out of burning building with explosion” but rather the far more benign “submit utterly inane proposals until given more specific instructions or turned off”.
This probably has some terrible flaw in it that I’m overlooking, of course, since I am not an expert in the field, but if there is, the flaws aren’t obvious enough for a layman to see. Or, just as likely, someone else came up with it first and published a paper describing exactly this. So I’m asking here.
This creates a universe where the Accept Outcome button gets pressed, not necessarily one that has a positive outcome. e.g. if the button was literally a button, something might fall on to it; or if it was a state in a computer, a cosmic ray might flip a bit.
True enough, but once we step outside of the thought experiment and take a look at the idea it is intended to represent, “button gets pressed” translates into “humanity gets convinced to accept the machine’s proposal”. Since the AI-analogue device has no motives or desires save to model the universe as perfectly as possible, P(A bit flips in the AI that leads to it convincing a human panel to do something bad) necessarily drops below P(A bit flips anywhere that leads to a human panel deciding to do something bad) and is discountable for the same reason why we ignore hypothesises like “Maybe a cosmic ray flipped a bit to make it do that?” when figuring out the source of computer errors in general.
P(A bit flips in the AI that leads to it convincing a human panel to do something bad) is always less than P(A bit flips anywhere that leads to a human panel deciding to do something bad), (the former is a subset of the latter).
The point of the cosmic ray statement is not so much that that might actually happen, but is just demonstrating that the Outcome-Pump-2.0-universe doesn’t necessarily result in a positive outcome, just that it is a universe that has had the “Outcome” accepted, and also that the Outcome being accepted doesn’t imply that the universe is one we like.
In this document from 2004 Yudkowsky describes a safeguard to be added “on top of” programming Friendliness, a Last Judge. The idea is that the FAI’s goal is initially only to compute what an FAI should do. Then the Last Judge looks at the FAI’s report, and decides whether or not to switch the AI’s goal system to implement the described world. The document should not be taken as representative of Yudkosky’s current views, because it’s been marked obsolete, but I favor the idea of having a Last Judge check to make sure before anybody hits the red button.
Welcome!
So no more problem if it kills you! But what if it kills you and destroys itself in the process?
The answer to that depends on how the time machine inside works. If it’s based on a “reset unless a message from the future is received saying not to” sort of deal, then you’re fine. Otherwise, you die. And neither situation has an analoge in the related AI design.
I don’t think it prevents the wireheading scenario that many people consider undesirable. For instance, if an AI modifies everybody into drooling idiots who are made deliriously happy by pressing “Accept Outcome” as often and forcefully as possible, it wins.
Or more mundanely, if it achieves a button-press by other means, such as causing a building to collapse on you, with a brick landing on the button.