evhub comments on Formal Solution to the Inner Alignment Problem

evhub 8 Mar 2021 21:46 UTC
LW: 2 AF: 2
AF
Here’s a simple toy model. Suppose you have two agents that internally compute their actions as follows (perhaps with actual argmax replaced with some smarter search algorithm, but still basically structured as below):

$\begin{matrix} M_{deceptive} (x) = {argmax}_{a} E [\sum i U_{deceptive} (s_{i}) | a] M_{aligned} (x) = {argmax}_{a} E [\sum i U_{aligned} (s_{i}) | a] \end{matrix}$

Then, comparing the K-complexity of the two models, we get

$K (M_{aligned}) - K (M_{deceptive}) \approx K (U_{aligned}) - K (U_{deceptive})$

and the problem becomes that both $M_{deceptive}$ and $M_{aligned}$ will produce behavior that looks aligned on the training distribution, but $U_{aligned}$ has to be much more complex. To see this, note that essentially any $U_{deceptive}$ will yield good training performance because the model will choose to act deceptively during training, whereas if you want to get good training performance without deception, then $U_{aligned}$ has to actually encode the full objective, which is likely to make it quite complicated.
- Vanessa Kosoy 8 Mar 2021 23:28 UTC
  LW: 5 AF: 4
  AF Parent
  I understand how this model explains why agents become unaligned under distributional shift. That’s something I never disputed. However, I don’t understand how this model applies to my proposal. In my proposal, there is no distributional shift, because (thanks to the realizability assumption) the real environment is sampled from the same prior that is used for training with $C$ . The model can’t choose to act deceptively during training, because it can’t distinguish between training and deployment. Moreover, the objective I described is not complicated.
  - evhub 8 Mar 2021 23:50 UTC
    LW: 2 AF: 2
    AF Parent
    Yeah, that’s a fair objection—my response to that is just that I think that preventing a model from being able to distinguish training and deployment is likely to be impossible for anything competitive.
    - Vanessa Kosoy 9 Mar 2021 0:35 UTC
      LW: 2 AF: 1
      AF Parent
      Okay, but why? I think that the reason you have this intuition is, the realizability assumption is false. But then you should concede that the main weakness of the OP is the realizability assumption rather than the difference between deep learning and Bayesianism.
      - evhub 9 Mar 2021 3:12 UTC
        LW: 2 AF: 2
        AF Parent
        Perhaps I just totally don’t understand what you mean by realizability, but I fail to see how realizability is relevant here. As I understand it, realizability just says that the true model has some non-zero prior probability—but that doesn’t matter (at least for the MAP, which I think is a better model than the full posterior for how SGD actually works) as long as there’s some deceptive model with greater prior probability that’s indistinguishable on the training distribution, as in my simple toy model from earlier.
        Vanessa Kosoy 9 Mar 2021 20:13 UTC
        LW: 4 AF: 3
        AF Parent
        When talking about uniform (worst-case) bounds, realizability just means the true environment is in the hypothesis class, but in a Bayesian setting (like in the OP) it means that our bounds scale with the probability of the true environment in the prior. Essentially, it means we can pretend the true environment was sampled from the prior. So, if (by design) training works by sampling environments from the prior, and (by realizability) deployment also consists of sampling an environment from the same prior, training and deployment are indistinguishable.
        evhub 9 Mar 2021 21:42 UTC
        LW: 2 AF: 2
        AF Parent
        Sure—by that definition of realizability, I agree that’s where the difficulty is. Though I would seriously question the practical applicability of such an assumption.