Stuart claims that if you give the system a fixed objective R, then it’s incentivized to ensure it can achieve that goal. Naturally, it takes any advantage to stop us from stopping it from best achieving the fixed objective.
Yann’s response reads as “who would possibly be so dumb as to build a system which would be incentivized to stop us from stopping it?”.
Now, I also think this response is weak. But let’s consider whether this is a reasonable response, even if it seemed reasonable to be confident that it will be easy to see when systems are flawed, and easy to build safeguards; neither is remotely likely to be true, in my opinion.
Suppose you have access to a computer terminal. You have a mathematical proof that if you type random characters into the terminal and then press Enter, one of the most likely outcomes is that it explodes and kills you. Now, a response analogous to Yann’s is: “who would be so dumb as to type random things into this computer? I won’t. Time to type my next paper.”.
I think a wiser response would be to ask what about this computer kills you if you type in random stuff, and think very, very carefully before you type anything. If you have text you want to type, you should be checking it against your detailed gears-level model of the computer to assure yourself that it won’t trigger whatever-causes-explosions. It’s not enough to say that you won’t type random stuff.
ETA: In particular, if, after learning of this proof, you find yourself still exactly as enthusiastic to type as you were before—well, we’ve discussed that failure mode before.
Here are my readings of the arguments.
Stuart claims that if you give the system a fixed objective R, then it’s incentivized to ensure it can achieve that goal. Naturally, it takes any advantage to stop us from stopping it from best achieving the fixed objective.
Yann’s response reads as “who would possibly be so dumb as to build a system which would be incentivized to stop us from stopping it?”.
Now, I also think this response is weak. But let’s consider whether this is a reasonable response, even if it seemed reasonable to be confident that it will be easy to see when systems are flawed, and easy to build safeguards; neither is remotely likely to be true, in my opinion.
Suppose you have access to a computer terminal. You have a mathematical proof that if you type random characters into the terminal and then press
Enter
, one of the most likely outcomes is that it explodes and kills you. Now, a response analogous to Yann’s is: “who would be so dumb as to type random things into this computer? I won’t. Time to type my next paper.”.I think a wiser response would be to ask what about this computer kills you if you type in random stuff, and think very, very carefully before you type anything. If you have text you want to type, you should be checking it against your detailed gears-level model of the computer to assure yourself that it won’t trigger whatever-causes-explosions. It’s not enough to say that you won’t type random stuff.
ETA: In particular, if, after learning of this proof, you find yourself still exactly as enthusiastic to type as you were before—well, we’ve discussed that failure mode before.
I think he thinks typing random things into the computer is benign, and that there is a narrow band of dumb queries that make it explode.