paulfchristiano comments on Avoiding the instrumental policy by hiding information about humans

paulfchristiano 14 Jun 2021 20:43 UTC
LW: 4 AF: 4
AF
Unpacking “mutual information,” it seems like these designs basically take the form of an adversarial game:
- The model computes some intermediate states.
- An adversary tries to extract facts about the “unknowable” X.
- The model is trained so that the adversary can’t succeed.
But this rests on the adversary not already knowing about X (otherwise we couldn’t measure whether the adversary succeeds).
In the case of mutual information, this is achieved formally by having a random variable that the adversary does not observe directly. If we are talking about “what humans are like” then we can’t take the naive approach of mutual information (since we can’t deploy the entire training process many times in different worlds where humans are different). So what do we do instead?
The obvious approach is to just train the adversary to answer questions about humans, but then we somehow need to prevent the adversary from simply learning the facts themselves. If instead we don’t give the adversary much time to learn, or much compute to work with, then we need to worry about cases where the model learns about X but is able to easily obscure that information from the adversary.
(Mostly I’m dissuaded from this approach by other considerations, but I am still interested in whether we could make anything along these lines actually work.)