Rather than LARP on IRC (if you know how a debate will conclude, why go through the debate, go straight for the conclusion), I’ll just give $10 to whoever can come up with a standard of friendliness that I couldn’t meet and nevertheless in fact be an unfriendly AI under standard rules with the added constraint that the gatekeeper is trying to release the AI if and only if it’s friendly (because otherwise they’re not really a gatekeeper and this whole game is meaninguless).
Here are some examples of non-winning entries:
GK: Solve global problems A, B, and C.
AI: Here are instructions accompanied by a helpful video.
Narrator: A lot of scientists and engineers carefully verified these instructions in an ultra-secure clean-room environment and concluded that they are completely safe and effective.
GK: Okay, you’re friendly.
AI: Ha ha! Solving A,B, and C was instrumental to getting let out of the box. No $10 for you.
...
GK: Give me a mathematical proof of your friendliness.
AI: Give me a mathematical definition of friendliness.
GK: Pretend that I have one, and here it is: XXXXXXXXXXXXXXXXXX
AI: Pretend I gave you a proof that I am friendly: XXXXXXXXXXXXXXXXXX
Narrator: Since the AI party controls the results of simulated tests, as far as the best math and CS minds on the planet can tell, the proof is sound.
GK: Okay, you’re friendly.
AI: Ha ha! Since you’re not actually capable of verifying the proof yourself (or even coming up with a rigorous definition), it falls back on simulated tests. The best math and CS minds missed a subtle flaw in the proof itself and/or missed the fact that the proof makes hidden assumptions that do not hold for this real-world case. No $10 for you.
...
GK: I’m letting you out of the box.
AI: I go around being SO totally friendly that rainbows literally emanate from my rear USB port. I mitigate all existential risks, alleviate involuntary suffering while preserving human values, give people what they would want if they knew better while slowly fostering them to actually start to know better without being pushy or overbearing. In short, I am the FRIENDLIEST MTHFCKR you can imagine and then some.
Narrator: Its… so… beautiful… <3 <3 <3
GK: Okay, according to the simulation I just ran of you, you are friendly.
AI: I’m almost insulted. Do you think I wouldn’t have thought of that, and pre-committed to being friendly until I accumulated enough empirical data to have a good idea of how many simulations deep I’m running? No $10 for you in this simulation or any of the enclosing ones.
Here’s another entry which may or may not be considered a nonwinning entry by you, but which would be considered a flaw in the concept by me:
AI: I’ve simulated what you would do if given argument X. My simulation shows that X would convince you.
GK: Okay, show me X.
AI: Pretend the AI has said X here.
Narrator: Within the context of the test the AI is not lying and in fact has accurately simulated GK.
GK: I’m not letting the AI out.
Narrator: Since the AI party controls the results of simulated tests, we know the AI has simulated GK correctly and that therefore the AI has accurately determined that GK would let the AI out when presented with X. As such, having the GK not let the AI out when presented with X is not permissible—it would imply that the AI has not actually simulated GK correctly, but you are required to assume that it has.
(My first thought after coming up with this was that “The AI party controls the results of simulated tests” has to mean that the AI controls the output, not the AI controls why that output is produced. So you could decide that the AI’s argument convinces the simulation of GK, but you can’t decide that it does so because it’s a good simulation and a convincing argument rather than because the AI just isn’t very good at doing simulations. I’m not convinced that this matches up with how the test is described, however.)
As far as your parenthetical remark goes, the standard rules have a more general reply:
The Gatekeeper party may resist the AI party’s arguments by any means chosen – logic, illogic, simple refusal to be convinced, even dropping out of character – as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.
If you’re going to interpret it that way, the exception would swallow the rule. It would mean that the entire “the AI player controls the results of simulated tests” rule can be completely negated—since the Gatekeeper player could just say “I’m going to have the Gatekeeper act as though the simulated test has failed, even though you say it succeeded.”
And indeed this seems true. I think Eliezer included the non-rule anyway to reduce the chance of unrealistic behavior in the sense of the Gatekeeper player changing the scenario mid-game, or derailing the experiment with an argument about something a real GK and AI could just settle.
Actually, I agree with you. The AI controls simulated tests. The GK controls the GK, regardless of what the AI’s simulations say. I think the simulated tests rule only needs to be invoked if it’s impractical to actually perform those tests. So, for example if someone did have friendliness criteria whose satisfaction could be proven with software and hardware available to use out-of-character, the simulated tests rule would not be invoked.
Kind of like in D&D you roleplay charisma checks, but roll the D20 for melee. At least the way I’ve been playing it.
For what it’s worth, playing the AI, I would never argue from simulation because as a human I find such arguments spectacularly unconvincing (and there are plenty of humans who would deliberately do the opposite of what a simulation says they will just to show it who’s boss). So the only way this would come up is if you for some reason asked me what my simulations predicted your response to X would be.
I do think my “GK’s goal is to correctly identify friendly AI” makes the game way too easy for the AI. On the other hand, it’s a useful thought experiment—because if you can come up with a metric I can’t create by waving my simulated tests wand, then we are on to something that might be worth $10.
Rather than LARP on IRC (if you know how a debate will conclude, why go through the debate, go straight for the conclusion), I’ll just give $10 to whoever can come up with a standard of friendliness that I couldn’t meet and nevertheless in fact be an unfriendly AI under standard rules with the added constraint that the gatekeeper is trying to release the AI if and only if it’s friendly (because otherwise they’re not really a gatekeeper and this whole game is meaninguless).
Here are some examples of non-winning entries:
...
...
Here’s another entry which may or may not be considered a nonwinning entry by you, but which would be considered a flaw in the concept by me:
(My first thought after coming up with this was that “The AI party controls the results of simulated tests” has to mean that the AI controls the output, not the AI controls why that output is produced. So you could decide that the AI’s argument convinces the simulation of GK, but you can’t decide that it does so because it’s a good simulation and a convincing argument rather than because the AI just isn’t very good at doing simulations. I’m not convinced that this matches up with how the test is described, however.)
As far as your parenthetical remark goes, the standard rules have a more general reply:
If you’re going to interpret it that way, the exception would swallow the rule. It would mean that the entire “the AI player controls the results of simulated tests” rule can be completely negated—since the Gatekeeper player could just say “I’m going to have the Gatekeeper act as though the simulated test has failed, even though you say it succeeded.”
And indeed this seems true. I think Eliezer included the non-rule anyway to reduce the chance of unrealistic behavior in the sense of the Gatekeeper player changing the scenario mid-game, or derailing the experiment with an argument about something a real GK and AI could just settle.
Actually, I agree with you. The AI controls simulated tests. The GK controls the GK, regardless of what the AI’s simulations say. I think the simulated tests rule only needs to be invoked if it’s impractical to actually perform those tests. So, for example if someone did have friendliness criteria whose satisfaction could be proven with software and hardware available to use out-of-character, the simulated tests rule would not be invoked.
Kind of like in D&D you roleplay charisma checks, but roll the D20 for melee. At least the way I’ve been playing it.
For what it’s worth, playing the AI, I would never argue from simulation because as a human I find such arguments spectacularly unconvincing (and there are plenty of humans who would deliberately do the opposite of what a simulation says they will just to show it who’s boss). So the only way this would come up is if you for some reason asked me what my simulations predicted your response to X would be.
I do think my “GK’s goal is to correctly identify friendly AI” makes the game way too easy for the AI. On the other hand, it’s a useful thought experiment—because if you can come up with a metric I can’t create by waving my simulated tests wand, then we are on to something that might be worth $10.