johnswentworth comments on Formal Inner Alignment, Prospectus

johnswentworth May 14, 2021, 2:00 AM
LW: 4 AF: 4
AF
I buy the “problems can be both” argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that’s solved, all that’s left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we’re thinking about optimization-under-uncertainty, although I’m still not sure exactly what that would mean.
In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense—correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn’t what we want.
I’d be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.
I’m not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn’t well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point “X” and “Y” at some things in the real world. That’s why the objective function cannot be meaningfully separated from the data/prior: “f(X, Y)” doesn’t mean anything, by itself.
But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today’s ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I’m not clear on exactly what it would mean.
- abramdemski May 15, 2021, 12:59 PM
  LW: 21 AF: 14
  AF Parent
  Trying to lay this disagreement out plainly:
  According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is “inner” = issues of imperfect search, “outer” = issues of objective (which can include the prior, the utility function, etc).
  According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don’t have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of “inner optimizers”. This “way the algorithm tries to fill in missing info” has to include properties of the search, so we roll search+prior together into “inductive bias”.
  I take your argument to have been:
  1. The strength of well-defined optimization as a natural concept;
  2. The weakness of any factorization which separates elements like prior, data, and loss function, because we really need to consider these together in order to see what task is being set for an ML system (Dr Nefarious demonstrates that the task “prediction” becomes the task “create a catastrophe” if prediction is pointed at the wrong data);
  3. The idea that the my/Evan/Paul’s concern about priors will necessarily be addressed by outer alignment, so does not need to be solved separately.
  Your crux is, can we factor ‘uncertainty’ from ‘value pointer’ such that the notion of ‘value pointer’ contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame.
  I take my argument to have been:
  1. The strength of optimization-under-uncertainty as a natural concept (I argue it is more often applicable than well-defined optimization);
  2. The naturalness of referring to problems involving inner optimizers under one umbrella “inner alignment problem”, whether or not Dr Nefarious is involved;
  3. The idea that the malign-prior problem has to be solved in itself whether we group it as an “inner issue” or an “outer issue”;
  4. For myself in particular, I’m ok with some issues-of-prior, such as Dr Nefarious, ending up as both inner alignment and outer alignment in a classification scheme (not overjoyed, but ok with it).
  My crux would be, does a solution to outer alignment (in the intuitive sense) really imply a solution to exorcising mesa-optimizers from a prior (in the sense relevant to eliminating them from perfect search)?
  It might also help if I point out that well-defined-optimization vs optimization-under-uncertainty is my current version of the selection/control distinction.
  In any case, I’m pretty won over by the uncertainty/pointer distinction. I think it’s similar to the capabilities/payload distinction Jessica has mentioned. This combines search and uncertainty (and any other generically useful optimization strategies) into the capabilities.
  But I would clarify that, wrt the ‘capabilities’ element, there seem to be mundane capabilities questions and then inner optimizer questions. IE, we might broadly define “inner alignment” to include all questions about how to point ‘capabilities’ at ‘payload’, but if so, I currently think there’s a special subset of ‘inner alignment’ which is about mesa-optimizers. (Evan uses the term ‘inner alignment’ for mesa-optimizer problems, and ‘objective-robustness’ for broader issues of reliably pursuing goals, but he also uses the term ‘capability robustness’, suggesting he’s not lumping all of the capabilities questions under ‘objective robustness’.)
  What links here?
  - Comparing Four Approaches to Inner Alignment by Lucas Teixeira (Jul 29, 2022, 9:06 PM; 38 points)
  - johnswentworth May 16, 2021, 3:22 PM
    LW: 11 AF: 8
    AF Parent
    This is a good summary.
    I’m still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
    It feels like “optimization under uncertainty” is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
    The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal prior do.
    Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
    It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
    The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
    … so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.)
    On a side note, it feels weird to be the one saying “we can’t separate uncertainty-handling from goals” and you saying “ok but it seems like goals and uncertainty could somehow be factored”. Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
    - abramdemski May 18, 2021, 4:56 PM
      LW: 2 AF: 2
      AF Parent
      Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
      Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.
      It feels like “optimization under uncertainty” is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
      Well, it still seems like a good name to me, so I’m curious what you are thinking here. What name would communicate better?
      It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
      Again, I need more unpacking to be able to say much (or update much).
      The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
      Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn’t necessarily a problem… but I am curious what feels non-tight about inner agency.
      … so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.)
      On a side note, it feels weird to be the one saying “we can’t separate uncertainty-handling from goals” and you saying “ok but it seems like goals and uncertainty could somehow be factored”. Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
      I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both “uncertainty can’t be separated from goals” and “uncertainty can be separated from goals” have true interpretations. (I have those interpretations clear in my head, but communication is hard.)
      OK, so.
      My sense of our remaining disagreement…
      We agree that the pointers/uncertainty could be factored (at least informally—currently waiting on any formalism).
      You think “optimization under uncertainty” is doing something different, and I think it’s doing something close.
      Specifically, I think “optimization under uncertainty” importantly is not necessarily best understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is:
      Pointer: The utility function.
      Uncertainty: The search plus the prior, which in practice can blend together into “inductive bias”.
      This isn’t perfect, by any means, but, I’m like, “this isn’t so bad, right?”
      I mean, I think this approximation is very not-good for talking about the pointers problem. But I think it’s not so bad for talking about inner alignment.
      I almost want to suggest that we hold off on trying to resolve this, and first, I write a whole post about “optimization under uncertainty” which clarifies the whole idea and argues for its centrality. However, I kind of don’t have time for that atm.
- abramdemski May 14, 2021, 11:30 PM
  LW: 4 AF: 4
  AF Parent
  However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that’s solved, all that’s left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all.
  The way I’m currently thinking of things, I would say the reverse also applies in this case.
  We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of “figuring out what we want”. But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search).
  So both you and Evan and Paul are agreeing that there’s this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as “inner” and/or “outer”.
  What is this problem like? Well, it’s broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn’t sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of “agents inside the prior” (or agents being favored by the inductive biases).
  To what extent “agents in the prior” should be lumped together with “agents in imperfect search”, I am not sure. But the term “inner optimizer” seems relevant.
  I’d be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.
  A good example of optimization-under-uncertainty that doesn’t look like that (at least, not overtly) is most applications of gradient descent.
  1. The true objective is not well-defined. IE, machine learning people generally can’t write down an objective function which (a) spells out what they want, and (b) can be evaluated. (What you want is generalization accuracy for the presently-unknown deployment data.)
  2. So, machine learning people create proxies to optimize. Training data is the start, but then you add regularizing terms to penalize complex theories.
  3. But none of these proxies is the full expected value (ie, expected generalization accuracy). If we could compute the full expected value, we probably wouldn’t be searching for a model at all! We would just use the EV calculations to make the best decision for each individual case.
  So you can see, we can always technically turn optimization-under-uncertainty into a well-defined optimization by providing a prior, but, this is usually so impractical that ML people often don’t even consider what their prior might be. Even if you did write down a prior, you’d probably have to do ordinary ML search to approximate that. Which goes to show that it’s pretty hard to eliminate the non-EV versions of optimization-under-uncertainty; if you try to do real EV, you end up using non-EV methods anyway, to approximate EV.
  The fact that we’re not really optimizing EV, in typical applications of gradient descent, explains why methods like early stopping or dropout (or anything else that messes with the ability of gradient descent to optimize the given objective) might be useful. Otherwise, you would only expect to use modifications if they helped the search find higher-value items. But in real cases, we sometimes prefer items that have a lower score on our proxy, when the-way-we-got-that-item gives us other reason to expect it to be good (early stopping being the clearest example of this).
  This in turn means we don’t even necessarily convert our problem to a real, solidly defined optimization problem, ever. We can use algorithms like gradient-descent-with-early-stopping just “because they work well” rather than because they optimize some specific quantity we can already compute.
  Which also complicates your argument, since if we’re never converting things to well-defined optimization problems, we can’t factor things into “imperfect search problems” vs “alignment given perfect search”—because we’re not really using search algorithms (in the sense of algorithms designed to get the maximum value), we’re using algorithms with a strong family resemblance to search, but which may have a few overtly-suboptimal kinks thrown in because those kinks tend to reduce Goodharting.
  In principle, a solution to an optimization-under-uncertainty problem needn’t look like search at all.
  Ah, here’s an example: online convex optimization. It’s a solid example of optimization-under-uncertainty, but, not necessarily thought of in terms of a prior and an expectation.
  So optimization-under-uncertainty doesn’t necessarily reduce to optimization.
  I claim it’s usually better to think about optimization-under-uncertainty in terms of regret bounds, rather than reduce it to maximization. (EG this is why Vanessa’s approach to decision theory is superior.)
  I’m not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn’t well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point “X” and “Y” at some things in the real world. That’s why the objective function cannot be meaningfully separated from the data/prior: “f(X, Y)” doesn’t mean anything, by itself.
  But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today’s ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I’m not clear on exactly what it would mean.
  These remarks generally make sense to me. Indeed, I think the ‘uncertainty-aspect’ and the ‘search aspect’ would be rolled up into one, since imperfect search falls under the uncertainty aspect (being logical uncertainty). We might not even be able to point to which parts are prior vs search… as with “inductive bias” in ML. So inner alignment problems would always be “the uncertainty is messed up”—forcibly unifying your search-oriented view on daemons w/ Evan’s prior-oriented view. More generally, we could describe the ‘uncertainty’ part as where ‘capabilities’ live.
  Naturally, this strikes me as related to what I’m trying to get at with optimization-under-uncertainty. An optimization-under-uncertainty algorithm takes a pointer, and provides all the ‘uncertainty’.
  But I don’t think it should quite be about separating the pointer-aspect and the uncertainty-aspect. The uncertainty aspect has what I’ll call “mundane issues” (eg, does it converge well given evidence, does it keep uncertainty broad w/o evidence) and “extraordinary issues” (inner optimizers). Mundane issues can be investigated with existing statistical tools/concepts. But the extraordinary issues seem to require new concepts. The mundane issues have to do with things like averages and limit frequencies. The extraordinary issues have to do with one-time events.
  The true heart of the problem is these “extraordinary issues”.