Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy 8 Mar 2021 23:28 UTC
LW: 5 AF: 4
0
AF
I understand how this model explains why agents become unaligned under distributional shift. That’s something I never disputed. However, I don’t understand how this model applies to my proposal. In my proposal, there is no distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with $C$ . The model can’t choose to act deceptively during training, because it can’t distinguish between training and deployment. Moreover, the objective I described is not complicated.
- evhub 8 Mar 2021 23:50 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Yeah, that’s a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.
  - Vanessa Kosoy 9 Mar 2021 0:35 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Okay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.
    - evhub 9 Mar 2021 3:12 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Perhaps I just totally don’t understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn’t matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there’s some deceptive model with greater prior probability that’s indistinguishable on the training distribution, as in my simple toy model from earlier.
      - Vanessa Kosoy 9 Mar 2021 20:13 UTC
        LW: 4 AF: 3
        0
        AF Parent
        When talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment also consists of sampling an environment from the same prior, training and deployment are indistinguishable.
        evhub 9 Mar 2021 21:42 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Sure—by that definition of realizability, I agree that’s where the difficulty is. Though I would seriously question the practical applicability of such an assumption.