The “instrumental” strategy … seems to be unnecessarily computationally complex. First it figures out what’s true, and then it strategically decides what to say in light of that. It would be a bit cheaper just to actually report what’s true, if we set up the training process well enough that honest reporting got you optimal reward.
This seems intuitive, but I don’t think there’s actually much of a distinction in complexity.
Specifically, the constraint “Respond honestly” doesn’t uniquely determine a response—unless we’re only considering questions where you’re able to specify the precise form of the answer ahead of time. In general, you also have to decide which honest statements to make, to what precision, with what context, explanation, caveats....
So it seems as though we’re comparing:
BAD: Figure out what’s true, and then strategically decide what to say based on what will satisfy the trainer.
M*: Figure out what’s true, and then decide which honest statements to make and in what form, based on what’s relevant, helpful, useful etc.
M* is searching a smaller space, so I’d guess it’d usually be faster, but that’s not immediately clear (to me at least). Both are going to have to compute some version of “What does the trainer want to hear?”.
I think this is the key issue. There’s a difference between inaccessible-as-in-expensive, and inaccessible-as-in-there-is-no-unique-answer.
If it’s really impossible to tell what Alice is thinking, the safe bet not that Alice has some Platonic property of What She’s Really Thinking that we just can’t access. The safe bet is that the abstract model of the world we have where there’s some unique answer to what Alice is thinking doesn’t match reality well here. What we want isn’t an AI that accesses Alice’s Platonic properties, we want an AI that figures out what’s “relevant, helpful, useful, etc.”
This seems intuitive, but I don’t think there’s actually much of a distinction in complexity.
Specifically, the constraint “Respond honestly” doesn’t uniquely determine a response—unless we’re only considering questions where you’re able to specify the precise form of the answer ahead of time. In general, you also have to decide which honest statements to make, to what precision, with what context, explanation, caveats....
So it seems as though we’re comparing:
BAD: Figure out what’s true, and then strategically decide what to say based on what will satisfy the trainer.
M*: Figure out what’s true, and then decide which honest statements to make and in what form, based on what’s relevant, helpful, useful etc.
M* is searching a smaller space, so I’d guess it’d usually be faster, but that’s not immediately clear (to me at least). Both are going to have to compute some version of “What does the trainer want to hear?”.
I think this is the key issue. There’s a difference between inaccessible-as-in-expensive, and inaccessible-as-in-there-is-no-unique-answer.
If it’s really impossible to tell what Alice is thinking, the safe bet not that Alice has some Platonic property of What She’s Really Thinking that we just can’t access. The safe bet is that the abstract model of the world we have where there’s some unique answer to what Alice is thinking doesn’t match reality well here. What we want isn’t an AI that accesses Alice’s Platonic properties, we want an AI that figures out what’s “relevant, helpful, useful, etc.”