adamShimi comments on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures

adamShimi 7 Nov 2020 20:05 UTC
LW: 19 AF: 7
AF
After discussing with Evan, he has a clearer decomposition in mind:
- Outer Alignment is according to his definition in this comment, that originally comes from this post: an objective function is aligned if all models optimal for it, after perfect optimization on infinite data, are aligned (including in the deployment phase). The salient point is the hypothesis of infinite data, which thus requires no generalization and no inductive bias consideration.
- Assuming Outer Alignment, there is still the problem of 2D-Robustness, as presented in this post. Basically, even with the right objective, the result of training might fail to generalize in the deployment environment, for one of two reasons: capability failure, where the model learned is just nowhere near optimal for the objective, and thus having the right objective doesn’t help; and objective failure, where a competent model is learned, but because of the lack of infinite data, it learns a different objective that also works in the training phase.
  So 2D-Robustness splits into Capability-Robustness and Objective-Robustness
- Within Objective Robustness, there is a specific type of robustness if the learned system is an explicit optimizer: Inner Alignment (as defined in Risks from Learned Optimization)
Note that the names used here are not necessarily how we should call these properties, they’re just the names from the different posts defining them.
Evan thinks, and I agree, that many of the confusion in this post’s comment comes from people having different definitions for inner alignment, and notably thinking that outer Alignment + Inner Alignment = Alignment, which is only the case for mesa-optimization; the general equation is instead Outer Alignment + 2D-Robustness = Alignment. For example, in my own comment, what I define as Inner Alignment is closer to 2D-Robustness.
Maybe the next step would be to check if that decomposition makes sense, and then try to find sensible names to use when discussing these issues.
- johnswentworth 7 Nov 2020 21:29 UTC
  LW: 7 AF: 1
  AF Parent
  The salient point is the hypothesis of infinite data, which thus requires no generalization and no inductive bias consideration.
  So, I didn’t comment on this earlier because it was orthogonal to the discussion at hand, but this is just wrong. Indeed, the fact that it’s wrong is the primary reason that generalization error is a problem in the first place.
  If the training data is taken from one distribution, and deployment exposes the system to a different distribution, then infinite data will not solve that problem. It does not matter how many data points we have, if they’re from a different distribution than deployment. Generalization error is not about number of data points, it is about a divergence between the process which produced the data points and the deployment environment.
  That said, while this definition needs some work, I do think the decomposition is sensible.
  - adamShimi 7 Nov 2020 21:43 UTC
    LW: 3 AF: 3
    AF Parent
    You’re absolutely right, but that’s not what I meant by this sentence, nor what Evan thinks.
    Here “infinite data” literally means having the data for the training environment and the deployment environment. It means that there is no situation where the system sees some input that was not available during training, because every possible input appears during training. This is obviously impossible to do in practice, but it allow the removal of inductive bias consideration at the theoretical level.
    - johnswentworth 8 Nov 2020 0:05 UTC
      LW: 2 AF: 1
      AF Parent
      Here “infinite data” literally means having the data for the training environment and the deployment environment.
      This also doesn’t work—there’s still a degree of freedom in how much of the data is from deployment, and how much from training. Could be 25% training distribution, could be 98% training distribution, and those will produce different optimal strategies. Heck, we could construct it in such a way that there’s infinite data from both but the fraction of data from deployment goes to zero as data size goes to infinity. In that case, the optimal policy in the limit would be exactly what’s optimal on the training distribution.
      When we’re optimizing for averages, it doesn’t just matter whether we’ve ever seen a particular input; it matters how often. The system is going to trade off better performance on more-often-seen data points for worse performance on less-often-seen data points.
      - evhub 8 Nov 2020 5:21 UTC
        LW: 2 AF: 2
        AF Parent
        That’s only true for RL—for SL, perfect loss requires being correct on every data point, regardless of how often it shows up in the distribution. For RL, that’s not true, but for RL we can just say that we’re talking about the optimal policy on the MDP that the model will actually encounter over its existence.
        johnswentworth 8 Nov 2020 6:45 UTC
        LW: 3 AF: 2
        AF Parent
        for SL, perfect loss requires being correct on every data point, regardless of how often it shows up in the distribution
        This is only true if identical data points always have the same label. To the extent that’s true in real data sets, it’s purely an artifact of finite data, and is almost certainly not true of the underlying process.
        Suppose I feed a system MRI data and labels for whether each patient has cancer, and train the system to predict cancer. MRI images are very high dimensional, so the same image will probably never occur twice in the data set; thus the system can be correct on every datapoint (in training). But if the data were actually infinite, this would fall apart—images would definitely repeat, and they would probably not have the same label every time, because an MRI image does not actually have enough information in it to 100% perfectly predict whether a patient has cancer. Given a particular image, the system will thus have to choose whether this image more often comes from a patient with or without cancer. And if that frequency differs between train and deploy environment, then we have generalization error.
        And this is not just an example of infinities doing weird things which aren’t relevant in practice, because real supervised learners do not learn every possible function. They have inductive biases—some of the learning is effectively unsupervised or baked in by priors. Indeed, in terms of bits of information, the exponentially vast majority of the learning is unsupervised or baked in by priors, and that will always be the case. The argument above thus applies not only to any pair of identical images, but any pair of images which are treated the same by the “unsupervised/inaccessible dimensions” of the learner.
        Taking more of an outside view… if we’re imagining a world in which the “true” label is a deterministic function of the input, and that deterministic function is in the supervised learner’s space, then our model has thrown away everything which makes generalization error a problem in practice. It’s called “robustness to distribution shift” for a reason.
        evhub 8 Nov 2020 8:53 UTC
        LW: 2 AF: 2
        AF Parent
        
        if we’re imagining a world in which the “true” label is a deterministic function of the input, and that deterministic function is in the supervised learner’s space, then our model has thrown away everything which makes generalization error a problem in practice.
        
        Yes—that’s the point. I’m trying to define outer alignment so I want the definition to get rid of generalization issues.
        johnswentworth 8 Nov 2020 16:53 UTC
        1 point
        AF Parent
        We want a definition which separates out generalization issues, not a definition which only does what we want in situations where generalization issues never happened in the first place.
        If we define outer alignment as (paraphrasing) “good performance assuming we’ve seen every data point once”, then that does not separate out generalization issues. If we’re in a situation where seeing every data point is not sufficient to prevent generalization error (i.e. most situations where generalization error actually occurs), then that definition will classify generalization error as an outer alignment error.
        To put it differently: when choosing definitions of things like “outer alignment”, we do not get to pick what-kind-of-problem we’re facing. The point of the exercise is to pick definitions which carve up the world under a range of architectures, environments, and problems.