As I presently understand the situation, there is literally nobody on Earth, including me, who has the knowledge needed to set themselves up to be blackmailed if they were deliberately trying to make that happen. Any potentially blackmailing AI would much prefer to have you believe that it is blackmailing you, without actually expending resources on following through with the blackmail, insofar as they think they can exert any control on you at all via an exotic decision theory. Just like in the oneshot Prisoner’s Dilemma the “ideal” outcome is for the other player to believe you are modeling them and will cooperate if and only if they cooperate, and so they cooperate, but then actually you just defect anyway. For the other player to be confident this will not happen in the Prisoner’s Dilemma, for them to expect you not to sneakily defect anyway, they must have some very strong knowledge about you.
Would this be a fair summary of why Basilisk does not work: “We don’t know of a way to detect a bluff by a smarter agent, therefore the agent would prefer bluffing (easy) over true blackmail (hard), so, knowing that we would always call the bluff and therefore the agent would not even try”?
Further on:
I have written the above with some reluctance, because even if I don’t yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I’ve said what it is.
Wouldn’t a trivial “way to repair this obstacle” be for the agent to appear stupid enough to be credible? Or has this already been taken into account in the original quote?
Wouldn’t a trivial “way to repair this obstacle” be for the agent to appear stupid enough to be credible?
What do you mean by ‘appear’ here? I know how to observe a real agent and think “hmm, this person will punish me without reflectively considering whether or not punishing me advances their interests,” but I don’t know how to get that impression about a hypothetical agent.
I don’t understand your distinction between real and hypothetical here. Your first sentence was about a hypothetical “real” agent, right? What is the hypothetical “hypothetical” agent you describe in the second part?
I don’t understand your distinction between real and hypothetical here.
Basically, my understanding of acausal trades is “ancestor does X because of expectation that it will make descendant do Y, descendant realizes the situation and decides to do Y because otherwise they wouldn’t have been made, even though there’s no direct causal effect.”
If you exist simultaneously with another agent (the ‘real agent’ from the grandparent), you can sense how they behave and they can trick you by manipulating what you sense. (The person might reflectively consider whether or not to punish you, and decide the causal link to their reputation is enough justification, even though there’s no causal link to the actions you took, but try to seem unthinking so you will expect they’ll always do that.)
If you’re considering hypothetical descendants (the ‘hypothetical agent’ from the grandparent), though, it’s not clear to me how to reason about their appearance to you now, and particular any attempts they make to ‘appear’ to be stupid. But now that I think about it more, I think I was putting too much intentionality into ‘appear’- hypothetical agent A can’t decide how I reason about it, but I can reason about it incorrectly or incompletely and thus it appears to be something it isn’t.
As far as I understand Eliezer’s point, the “acausal” part is irrelevant, the same issue of trusting that another agent really means what it says and will not change its mind later comes up, anyway. I could easily be wrong, though.
A newbie question.
From one of Eliezer’s replies:
Would this be a fair summary of why Basilisk does not work: “We don’t know of a way to detect a bluff by a smarter agent, therefore the agent would prefer bluffing (easy) over true blackmail (hard), so, knowing that we would always call the bluff and therefore the agent would not even try”?
Further on:
Wouldn’t a trivial “way to repair this obstacle” be for the agent to appear stupid enough to be credible? Or has this already been taken into account in the original quote?
What do you mean by ‘appear’ here? I know how to observe a real agent and think “hmm, this person will punish me without reflectively considering whether or not punishing me advances their interests,” but I don’t know how to get that impression about a hypothetical agent.
I don’t understand your distinction between real and hypothetical here. Your first sentence was about a hypothetical “real” agent, right? What is the hypothetical “hypothetical” agent you describe in the second part?
Basically, my understanding of acausal trades is “ancestor does X because of expectation that it will make descendant do Y, descendant realizes the situation and decides to do Y because otherwise they wouldn’t have been made, even though there’s no direct causal effect.”
If you exist simultaneously with another agent (the ‘real agent’ from the grandparent), you can sense how they behave and they can trick you by manipulating what you sense. (The person might reflectively consider whether or not to punish you, and decide the causal link to their reputation is enough justification, even though there’s no causal link to the actions you took, but try to seem unthinking so you will expect they’ll always do that.)
If you’re considering hypothetical descendants (the ‘hypothetical agent’ from the grandparent), though, it’s not clear to me how to reason about their appearance to you now, and particular any attempts they make to ‘appear’ to be stupid. But now that I think about it more, I think I was putting too much intentionality into ‘appear’- hypothetical agent A can’t decide how I reason about it, but I can reason about it incorrectly or incompletely and thus it appears to be something it isn’t.
As far as I understand Eliezer’s point, the “acausal” part is irrelevant, the same issue of trusting that another agent really means what it says and will not change its mind later comes up, anyway. I could easily be wrong, though.