Take an outer-aligned system, then add a 0 to each training input and a 1 to each deployment input. Wouldn’t this add only malicious hypotheses that can be removed by inspection without any adverse selection effects?
After thinking about it for a couple minutes, this question is both more interesting and less trivial than it seemed. The answer is not obvious to me.
On the face of it, passing in a bit which is always constant in training should do basically nothing—the system has no reason to use a constant bit. But if the system becomes reflective (i.e. an inner optimizer shows up and figures out that it’s in a training environment), then that bit could be used. In principle, this wouldn’t necessarily be malicious—the bit could be used even by aligned inner optimizers, as data about the world just like any other data about the world. That doesn’t seem likely with anything like current architectures, but maybe in some weird architecture which systematically produced aligned inner optimizers.
The hypotheses after the modification are supposed to have knowledge that they’re in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form “Return whatever maximizes property _ of the multiverse”, the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.
As far as I understand, whether minimal circuits are daemon-free is precisely the question whether direct descriptions of the input distribution are simpler than hypotheses of form “Return whatever maximizes property _ of the multiverse”.
Take an outer-aligned system, then add a 0 to each training input and a 1 to each deployment input. Wouldn’t this add only malicious hypotheses that can be removed by inspection without any adverse selection effects?
After thinking about it for a couple minutes, this question is both more interesting and less trivial than it seemed. The answer is not obvious to me.
On the face of it, passing in a bit which is always constant in training should do basically nothing—the system has no reason to use a constant bit. But if the system becomes reflective (i.e. an inner optimizer shows up and figures out that it’s in a training environment), then that bit could be used. In principle, this wouldn’t necessarily be malicious—the bit could be used even by aligned inner optimizers, as data about the world just like any other data about the world. That doesn’t seem likely with anything like current architectures, but maybe in some weird architecture which systematically produced aligned inner optimizers.
The hypotheses after the modification are supposed to have knowledge that they’re in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form “Return whatever maximizes property _ of the multiverse”, the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.
Ok, that should work assuming something analogous to Paul’s hypothesis about minimal circuits being daemon-free.
As far as I understand, whether minimal circuits are daemon-free is precisely the question whether direct descriptions of the input distribution are simpler than hypotheses of form “Return whatever maximizes property _ of the multiverse”.