Most of your points are valid, and Holden is pretty arrogant to think he sees this obvious solution that experts in the field are irresponsible for not doing.
But I can see a couple ways around this argument in particular:
Example question: “How should I get rid of my disease most cheaply?” Example answer: “You won’t. You will die soon, unavoidably. This report is 99.999% reliable”. Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.
Option 1: Forbid self-fulfilling prophecies—i.e. the AI cannot base its suggestions on predictions that are contingent upon the suggestions themselves. (Self-fulfilling prophecies are a common failure mode of human reasoning, so shouldn’t we defend our AIs against them?)
Option 2: Indeed, it could be said that the first prediction really isn’t accurate, because the stated prediction was that the disease would kill you, not that the AI would convince you to kill yourself. This requires the AI to have a model of causation, but that’s probably necessary anyway. Indeed, it probably will need a very rich model of causation, wherein “If X, then Y” does not mean the same thing as “X caused Y”. After all, we do.
Obviously both of these would need to be formalized, and could raise problems of their own; but it seems pretty glib to say that this one example proves we should make all our AIs completely ignoring the question of whether their predictions are accurate. (Indeed, is it even possible to make an expected-utility maximizer that doesn’t care whether its predictions are accurate?)
Forbid self-fulfilling prophecies—i.e. the AI cannot base its suggestions on predictions that are contingent upon the suggestions themselves.
You can’t forbid self-fullfilling prophecies and still have a functioning AI. The whole point is to find a self-fullfilling prophecy that something good will happen. The problem illustrated is that the AI chose a self-fullfilling prophecy that ranked highly in the simply specified goal it was optimizing for, but ranked poorly in terms of what the human actually wanted. That is, the AI was fully capable of granting the wish as it understood it, but the wish it understood was not what the human meant to wish for.
Indeed, it could be said that the first prediction really isn’t accurate, because the stated prediction was that the disease would kill you, not that the AI would convince you to kill yourself.
This might sound nit-picky, but you started it :)
At no point does the example answer claim that the disease killed you. It just claims that it’s certain (a) you won’t get rid of it, and (b) you will die. That’d be technically accurate if the oracle planned to kill you with a meme, just as it would also be accurate if it predicted a piano will fall on you.
(You never asked about pianos, and it’s just a very carefully limited oracle so it doesn’t volunteer that kind of information.)
(I guess even if we got FAI right the first time, there’d still be a big chance we’d all die just because we weren’t paying enough attention to what it was saying...)
Most of your points are valid, and Holden is pretty arrogant to think he sees this obvious solution that experts in the field are irresponsible for not doing.
But I can see a couple ways around this argument in particular:
Option 1: Forbid self-fulfilling prophecies—i.e. the AI cannot base its suggestions on predictions that are contingent upon the suggestions themselves. (Self-fulfilling prophecies are a common failure mode of human reasoning, so shouldn’t we defend our AIs against them?) Option 2: Indeed, it could be said that the first prediction really isn’t accurate, because the stated prediction was that the disease would kill you, not that the AI would convince you to kill yourself. This requires the AI to have a model of causation, but that’s probably necessary anyway. Indeed, it probably will need a very rich model of causation, wherein “If X, then Y” does not mean the same thing as “X caused Y”. After all, we do.
Obviously both of these would need to be formalized, and could raise problems of their own; but it seems pretty glib to say that this one example proves we should make all our AIs completely ignoring the question of whether their predictions are accurate. (Indeed, is it even possible to make an expected-utility maximizer that doesn’t care whether its predictions are accurate?)
You can’t forbid self-fullfilling prophecies and still have a functioning AI. The whole point is to find a self-fullfilling prophecy that something good will happen. The problem illustrated is that the AI chose a self-fullfilling prophecy that ranked highly in the simply specified goal it was optimizing for, but ranked poorly in terms of what the human actually wanted. That is, the AI was fully capable of granting the wish as it understood it, but the wish it understood was not what the human meant to wish for.
This might sound nit-picky, but you started it :)
At no point does the example answer claim that the disease killed you. It just claims that it’s certain (a) you won’t get rid of it, and (b) you will die. That’d be technically accurate if the oracle planned to kill you with a meme, just as it would also be accurate if it predicted a piano will fall on you.
(You never asked about pianos, and it’s just a very carefully limited oracle so it doesn’t volunteer that kind of information.)
(I guess even if we got FAI right the first time, there’d still be a big chance we’d all die just because we weren’t paying enough attention to what it was saying...)