abramdemski answers Why is pseudo-alignment “worse” than other ways ML can fail to generalize?

abramdemski 13 Apr 2022 5:21 UTC
LW: 5 AF: 4
AF
So, I think the other answers here are adequate, but not super satisfying. Here is my attempt.
The frame of “generalization failures” naturally primes me (and perhaps others) to think of ML as hunting for useful patterns, but instead fitting to noise. While pseudo-alignment is certainly a type of generalization failure, it has different connotations: that of a system which has “correctly learned” (in the sense of internalizing knowledge for its own use), but still does not perform as intended.
The mesa-optimizers paper defines inner optimizers as performing “search”. I think there are some options here and we can define things slightly differently.
In Selection vs Control, I split “optimization” up into two types: “selection” (which includes search, and also weaker forms of selection such as mere sampling bias), and “control” (which implies actively steering the world in a direction, but doesn’t always imply search, EG in the case of a thermostat).
In mesa-search vs mesa-control, I applied this distinction to mesa-optimization, arguing that mesa-optimizers which do not use search could still present a danger.
Mesa-controllers are a form of inner optimizers which reliably steer the world in a particular direction. These are distinguished from ‘mere’ generalization failure because generalization failures do not usually have such an impact. If we define pseudo-alignment in this way, you could say we are defining it by its impact. Clearly, more impactful generalization failures are more concerning. However, you might think it’s a little weird to invent entirely new terminology for this case, rather than referring to it as “impactful generalization failures”.
Mesa-searchers are a form of inner optimizers characterized by performing internal search. You could say that they’re clearly computing something coherent, just not what was desired (which may not be the case for ‘mere’ generalization failures). These are more clearly a distinct phenomenon, particularly if the intended behavior didn’t involve search. (It would seem odd to call them “searching generalization failures” imho.) But the safety concerns are less directly obvious.
It’s only when we put these two together that we have something both distinct and of safety concern. Looking for “impactful generalization failures” gives us relatively little to grab onto. But it’s particularly plausible that mesa-searchers will also be mesa-controllers, because the machinery for complex planning is present. So, this combination might be particularly worth thinking about.
we’ll see this appear mathematically in the definition of the property or in theorems about it, whether or not we have explicitly considered the possibility of mesa-optimizers. (I suppose the argument could be that some candidate safety properties implicitly assume no optimum is a mesa-optimizer, and thus appear to apply to all optima while not really doing so—somewhat analogous to early notions of continuity which implicitly assumed away the Weierstrass function. But if so, I need a real example of such a case to convince me.)
I tend to agree with this line of thinking. IE, it seems intuitive to me that highly robust alignment technology would rely on arguments that don’t explicitly mention inner optimization anywhere, because those failure modes are ruled out via the same general arguments which rule out other failure modes. However, it also seems plausible to me that it’s useful to think about inner alignment along the way.
You wanted a convincing example. I think The Solomonoff Prior Is Malign could be such an example. Before becoming aware of this argument, it seemed pretty plausible to me that the Solomonoff prior described a kind of rational ideal for induction. This isn’t a full “here’s a safety argument that would go through if we assumed no-inner-optimizers”, but in a parallel universe where we have access to infinite computation, it could be close to that. (EG, someone could argue that the chance of Solomonoff Induction resulting in generalization failures is very low, and then change their mind when they hear the Solomonoff-is-malign argument.)
Also, it seems highly plausible to me that inner alignment is a useful thing to have in mind for “less than highly robust” alignment approaches (approaches which seek to grab the lower-hanging fruit of alignment research, to create systems that are aligned in the worlds where achieving alignment isn’t so hard after all). These approaches can, for example, employ heuristics which make it somewhat unlikely that inner optimizers will emerge. I’m not very interested in that type of alignment research, because it seems to me that alignment technology needs to be backed up with rather tight arguments in order to have any realistic chance of working; but it makes sense for some people to think about that sort of thing.