abramdemski comments on Formal Inner Alignment, Prospectus

abramdemski May 15, 2021, 12:59 PM
LW: 21 AF: 14
AF
Trying to lay this disagreement out plainly:
According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is “inner” = issues of imperfect search, “outer” = issues of objective (which can include the prior, the utility function, etc).
According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don’t have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of “inner optimizers”. This “way the algorithm tries to fill in missing info” has to include properties of the search, so we roll search+prior together into “inductive bias”.
I take your argument to have been:
1. The strength of well-defined optimization as a natural concept;
2. The weakness of any factorization which separates elements like prior, data, and loss function, because we really need to consider these together in order to see what task is being set for an ML system (Dr Nefarious demonstrates that the task “prediction” becomes the task “create a catastrophe” if prediction is pointed at the wrong data);
3. The idea that the my/Evan/Paul’s concern about priors will necessarily be addressed by outer alignment, so does not need to be solved separately.
Your crux is, can we factor ‘uncertainty’ from ‘value pointer’ such that the notion of ‘value pointer’ contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame.
I take my argument to have been:
1. The strength of optimization-under-uncertainty as a natural concept (I argue it is more often applicable than well-defined optimization);
2. The naturalness of referring to problems involving inner optimizers under one umbrella “inner alignment problem”, whether or not Dr Nefarious is involved;
3. The idea that the malign-prior problem has to be solved in itself whether we group it as an “inner issue” or an “outer issue”;
4. For myself in particular, I’m ok with some issues-of-prior, such as Dr Nefarious, ending up as both inner alignment and outer alignment in a classification scheme (not overjoyed, but ok with it).
My crux would be, does a solution to outer alignment (in the intuitive sense) really imply a solution to exorcising mesa-optimizers from a prior (in the sense relevant to eliminating them from perfect search)?
It might also help if I point out that well-defined-optimization vs optimization-under-uncertainty is my current version of the selection/control distinction.
In any case, I’m pretty won over by the uncertainty/pointer distinction. I think it’s similar to the capabilities/payload distinction Jessica has mentioned. This combines search and uncertainty (and any other generically useful optimization strategies) into the capabilities.
But I would clarify that, wrt the ‘capabilities’ element, there seem to be mundane capabilities questions and then inner optimizer questions. IE, we might broadly define “inner alignment” to include all questions about how to point ‘capabilities’ at ‘payload’, but if so, I currently think there’s a special subset of ‘inner alignment’ which is about mesa-optimizers. (Evan uses the term ‘inner alignment’ for mesa-optimizer problems, and ‘objective-robustness’ for broader issues of reliably pursuing goals, but he also uses the term ‘capability robustness’, suggesting he’s not lumping all of the capabilities questions under ‘objective robustness’.)
What links here?
- Comparing Four Approaches to Inner Alignment by Lucas Teixeira (Jul 29, 2022, 9:06 PM; 38 points)
- johnswentworth May 16, 2021, 3:22 PM
  LW: 11 AF: 8
  AF Parent
  This is a good summary.
  I’m still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:
  - It feels like “optimization under uncertainty” is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
  - The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal prior do.
  - Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
  - It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
  - The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
  … so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.)
  On a side note, it feels weird to be the one saying “we can’t separate uncertainty-handling from goals” and you saying “ok but it seems like goals and uncertainty could somehow be factored”. Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
  - abramdemski May 18, 2021, 4:56 PM
    LW: 2 AF: 2
    AF Parent
    Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
    Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.
    It feels like “optimization under uncertainty” is not quite the right name for the thing you’re trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
    Well, it still seems like a good name to me, so I’m curious what you are thinking here. What name would communicate better?
    It does seem like there’s in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
    Again, I need more unpacking to be able to say much (or update much).
    The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.
    Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn’t necessarily a problem… but I am curious what feels non-tight about inner agency.
    … so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it’s capturing something different. (Though that’s based on just a handful of examples, so the idea in your head is probably quite different from what I’ve interpolated from those examples.)
    On a side note, it feels weird to be the one saying “we can’t separate uncertainty-handling from goals” and you saying “ok but it seems like goals and uncertainty could somehow be factored”. Usually I expect you to be the one saying uncertainty can’t be separated from goals, and me to say the opposite.
    I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both “uncertainty can’t be separated from goals” and “uncertainty can be separated from goals” have true interpretations. (I have those interpretations clear in my head, but communication is hard.)
    OK, so.
    My sense of our remaining disagreement…
    We agree that the pointers/uncertainty could be factored (at least informally—currently waiting on any formalism).
    You think “optimization under uncertainty” is doing something different, and I think it’s doing something close.
    Specifically, I think “optimization under uncertainty” importantly is not necessarily best understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is:
    Pointer: The utility function.
    Uncertainty: The search plus the prior, which in practice can blend together into “inductive bias”.
    This isn’t perfect, by any means, but, I’m like, “this isn’t so bad, right?”
    I mean, I think this approximation is very not-good for talking about the pointers problem. But I think it’s not so bad for talking about inner alignment.
    I almost want to suggest that we hold off on trying to resolve this, and first, I write a whole post about “optimization under uncertainty” which clarifies the whole idea and argues for its centrality. However, I kind of don’t have time for that atm.