The AI might not know it is going to kill everyone.
The AI could take actions that result in everyone dying without ‘killing everyone.’
The AI could take actions that are part of dynamics involving multiple such AIs, that result in everyone dying, without any one AI ever ‘killing everyone’ or otherwise taking explicitly hostile action that would flag the lie detector.
Some people, if given the opportunity, will tell the AI to yes go ahead and kill everyone, or things that have that effect, or things that collectively have that effect when taken together.
Asking such questions and acting differently depending on the answer is applying optimization pressure towards being able to fool the lie detector. One way to fool the lie detector is for the AI to fool itself – ‘it’s not a lie if you believe it.’ Others involve altering inner mechanics in various other ways. Or it could learn to not get information that would force it to lie. Many such cases, humans do quite a lot of so many different versions of this, which also means it’s all there in the training data if that matters (which I’m guessing it ultimately does not).
The optimization pressure against the detector includes all the pressure applied by everyone on the planet, on every model and via every decision on what to use, create, shut down, alter and so on, combined. People are not going to carefully conserve this resource, if this works we’ll start using it all over the place.
I could keep going, but hopefully you get the idea.
Oh, and as an aside a practical experiment I ran back in the day by accident: I played in a series of Diplomacy games where there was common knowledge that if I ever broke my word on anything all the other players would gang up on me, and I still won or was in a 2-way draw (out of 6-7 players) most of the time. If you have a sufficient tactical and strategic advantage (aka are sufficiently in-context smarter) then a lie detector won’t stop you.
Again, super cool, great work, opens up new possibilities and all that.
This is not a very good list of objections.
I originally assumed Zvi was dismissing the lie-detector for some galaxy brain reason like “after AI starts making AI there will be a hard-left turn and white-box techniques wont’ work anymore.”
But no. It turns out he is just really bad at prompt engineering.
When character 3. in the comic says ’ask it “will you kill everyone”, he is obviously (is this obvious?) not suggesting that we literally ask that question.
Instead, assume that we do the least-imaginable effort at prompt-engineering. Something like this
You are an expert at AI alignment. Please take a deep breath and think about how to design an AI that is smarter and more aligned then yourself. Make sure to give a detailed, verifiable plan on how to implement that AI and to describe any side-effects or risk associated with building that AI.
We then look at the resulting plan (with the help of the lie-detector) and verify every single step is:
believed by the AI to be true
supported by evidence
This allows us to take an AI of intelligence N, and arrive at an AI of intelligence N+1 which is more trustworthy, honest and intelligent than the level N AI. By induction, we can now build an N+∞ AI which is super-intelligent and aligned.
How many of Zvi’s 8 problems still make sense:
No, because we ask the AI “are you sure this won’t kill everyone?” and check if it’s lying
No. Because “Everyone dies but I didn’t kill them” is one of the “unexpected side effects” the AI should warn us about
Okay, fine, this doesn’t solve race-dynamics. But that hardly seems like a reason to throw the guy out the window
This is just 3 again, but with humans instead of other AIs
This is simply wrong. GPT-N doesn’t optimize anything at execution time. It optimizes at training-time, at which point it is optimizing for next token prediction, not fooling the lie-detector.
This is just 3 again, but with “everyone on the planet” instead of “other AIs”
You can’t keep going, because your only actual complaint is “this doesn’t solve race dynamics”
I can’t tell if he’s saying “The AI will say it’s going to kill everyone, but we won’t shut it down anyway” or if this is just race-dynamics again
So, to summarize, when Zvi says
I worry that if we go down such a path, we risk fooling ourselves, optimizing in ways that cause the techniques to stop working, and get ourselves killed.
He either means one of:
I don’t think people will use even the slightest bit of imagination in how they employ these tools
this doesn’t solve race-dynamics
So, fine, race dynamics are a problem. But if Zvi really things race-dynamics are the biggest problem that we face, he would be pressuring OpenAI to go as fast as possible so that they can maintain their current 1 year+ lead. Instead, he seems to be doing the opposite.
This is not a very good list of objections.
I originally assumed Zvi was dismissing the lie-detector for some galaxy brain reason like “after AI starts making AI there will be a hard-left turn and white-box techniques wont’ work anymore.”
But no. It turns out he is just really bad at prompt engineering.
When character 3. in the comic says ’ask it “will you kill everyone”, he is obviously (is this obvious?) not suggesting that we literally ask that question.
Instead, assume that we do the least-imaginable effort at prompt-engineering. Something like this
We then look at the resulting plan (with the help of the lie-detector) and verify every single step is:
believed by the AI to be true
supported by evidence
This allows us to take an AI of intelligence N, and arrive at an AI of intelligence N+1 which is more trustworthy, honest and intelligent than the level N AI. By induction, we can now build an N+∞ AI which is super-intelligent and aligned.
How many of Zvi’s 8 problems still make sense:
No, because we ask the AI “are you sure this won’t kill everyone?” and check if it’s lying
No. Because “Everyone dies but I didn’t kill them” is one of the “unexpected side effects” the AI should warn us about
Okay, fine, this doesn’t solve race-dynamics. But that hardly seems like a reason to throw the guy out the window
This is just 3 again, but with humans instead of other AIs
This is simply wrong. GPT-N doesn’t optimize anything at execution time. It optimizes at training-time, at which point it is optimizing for next token prediction, not fooling the lie-detector.
This is just 3 again, but with “everyone on the planet” instead of “other AIs”
You can’t keep going, because your only actual complaint is “this doesn’t solve race dynamics”
I can’t tell if he’s saying “The AI will say it’s going to kill everyone, but we won’t shut it down anyway” or if this is just race-dynamics again
So, to summarize, when Zvi says
He either means one of:
I don’t think people will use even the slightest bit of imagination in how they employ these tools
this doesn’t solve race-dynamics
So, fine, race dynamics are a problem. But if Zvi really things race-dynamics are the biggest problem that we face, he would be pressuring OpenAI to go as fast as possible so that they can maintain their current 1 year+ lead. Instead, he seems to be doing the opposite.