johnswentworth comments on Confucianism in AI Alignment

johnswentworth 3 Nov 2020 3:20 UTC
LW: 4 AF: 3
AF
After thinking about it for a couple minutes, this question is both more interesting and less trivial than it seemed. The answer is not obvious to me.
On the face of it, passing in a bit which is always constant in training should do basically nothing—the system has no reason to use a constant bit. But if the system becomes reflective (i.e. an inner optimizer shows up and figures out that it’s in a training environment), then that bit could be used. In principle, this wouldn’t necessarily be malicious—the bit could be used even by aligned inner optimizers, as data about the world just like any other data about the world. That doesn’t seem likely with anything like current architectures, but maybe in some weird architecture which systematically produced aligned inner optimizers.
- Gurkenglas 3 Nov 2020 3:39 UTC
  LW: 2 AF: 1
  AF Parent
  The hypotheses after the modification are supposed to have knowledge that they’re in training, for example because they have enough compute to find themselves in the multiverse. Among hypotheses with equal behavior in training, we select the simpler one. We want this to be the one that disregards that knowledge. If the hypothesis has form “Return whatever maximizes property _ of the multiverse”, the simpler one uses that knowledge. It is this form of hypothesis which I suggest to remove by inspection.
  - johnswentworth 3 Nov 2020 17:01 UTC
    LW: 2 AF: 1
    AF Parent
    Ok, that should work assuming something analogous to Paul’s hypothesis about minimal circuits being daemon-free.
    - Gurkenglas 4 Nov 2020 3:09 UTC
      2 points
      Parent
      As far as I understand, whether minimal circuits are daemon-free is precisely the question whether direct descriptions of the input distribution are simpler than hypotheses of form “Return whatever maximizes property _ of the multiverse”.