Sammy Martin comments on Alignment By Default

Sammy Martin Aug 13, 2020, 12:45 PM
LW: 25 AF: 9
1
AF
I think what you’ve identified here is a weakness in the high-level, classic arguments for AI risk -
Overall, I’d give maybe a 10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values. The main failure mode I’d expect, assuming we get the chance to iterate, is deception—not necessarily “intentional” deception, just the system being optimized to look like it’s working the way we want rather than actually working the way we want. It’s the proxy problem again, but this time at the level of humans-trying-things-and-seeing-if-they-work, rather than explicit training objectives.
This failure mode of deceptive alignment seems like it would result most easily from Mesa-optimisation or an inner alignment failure. Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the ‘classic arguments’ for AI safety—the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.
I think that it should be presented in initial introductions to AI risk alongside those classic arguments, as the specific, technical reason why the specific techniques we use are likely to produce such goal/capability divergence—rather than the general a priori reasons given by the classic arguments.
What links here?
- Categorizing failures as “outer” or “inner” misalignment is often confused by Rohin Shah (Jan 6, 2023, 3:48 PM; 93 points)
- johnswentworth Aug 13, 2020, 4:12 PM
  LW: 18 AF: 8
  AF Parent
  Personally, I think a more likely failure mode is just “you get what you measure”, as in Paul’s write up here. If we only know how to measure certain things which are not really the things we want, then we’ll be selecting for not-what-we-want by default. But I know at least some smart people who think that inner alignment is the more likely problem, so you’re in good company.
  - Sammy Martin Aug 13, 2020, 4:53 PM
    LW: 14 AF: 5
    AF Parent
    ‘You get what you measure’ (outer alignment failure) and Mesa optimisers (inner failure) are both potential gap fillers that explain why specifically the alignment/capability divergence initially arises. Whether it’s one or the other, I think the overall point is still that there is this gap in the classic arguments that allows for a (possibly quite high) chance of ‘alignment by default’, for the reasons you give, but there are at least 2 plausible mechanisms that fill this gap. And then I suppose my broader point would be that we should present:
    
    Classic Arguments —> objections to them (capability and alignment often go together, could get alignment by default) —> specific causal mechanisms for misalignment
    What links here?
    Sammy Martin's comment on Investigating AI Takeover Scenarios by Sammy Martin (Sep 17, 2021, 7:20 PM; 13 points)
  - Ben Pace Aug 14, 2020, 3:20 AM
    LW: 13 AF: 6
    AF Parent
    Am surprised you think that’s the main failure mode. I am fairly more concerned about failure through mesa optimisers taking a treacherous turn.
    I’m thinking we will be more likely to find sensible solutions to outer alignment, but have not much real clue about the internals, and then we’ll give them enough optimisation power to build super intelligent unaligned mesa optimisers, and then with one treacherous turn the game will be up.
    Why do you think inner alignment will be easier?
    - johnswentworth Aug 14, 2020, 4:32 PM
      LW: 9 AF: 2
      AF Parent
      Two arguments here. First, an outside-view argument: inner alignment problems should only crop up on a relatively narrow range of architectures/parameters. Second, an entirely separate inside-view argument: assuming that natural abstractions are a thing makes inner alignment failure look much less likely.
      Narrow range argument: inner alignment failure only applies to a specific range of architectures within a specific range of task parameters—for instance, we have to be optimizing for something, and there has to be lots of relevant variables observed only at runtime, and there has to be something like a “training” phase in which we lock-in parameter choices before runtime, and for the more disastrous versions we usually need divergence of the runtime distribution from the training distribution. It’s a failure mode which assumes that a whole lot of things look like today’s ML pipelines.
      On the other hand, the get-what-you-measure problem and its generalizations apply to any architecture, including tool AI, idealized Bayesian utility maximizers (i.e. the infinite data/compute regime), and (less obviously) human-mimicking systems.
      Natural abstractions argument: in an inner alignment failure, the outer optimizer is optimizing for $X$ , but the inner optimizer ends up pointed at some rough approximation $~ X$ . But if X is a natural abstraction, then this is far less likely to be a problem; we expect a wide range of predictive systems to all learn a basically-correct notion of $X$ , so there’s little reason for an inner optimizer to end up pointed at a rough approximation, especially if we’re leveraging transfer learning from some unsupervised learner.
      (It’s worth asking here why this argument doesn’t apply to the divergence of human goals from evolutionary fitness. A human only has ~30k genes, and each one has a fairly simple function—e.g. catalyze one chemical reaction or stabilize a structure or the like. That’s nowhere near enough to represent something like evolutionary fitness in the genome, especially when the large majority of those genes are already used for metabolism and body plan and whatnot. Modern ML, on the other hand, already operates in a range where insufficient degrees of freedom are far less likely to be a problem. Also, I’m currently unsure whether evolutionary fitness is a natural abstraction at all.)
      In general, if human values are a natural abstraction, then pointing to values is much harder than “learning” values. That means outer alignment is the problem more than inner alignment.
      - evhub Aug 14, 2020, 6:44 PM
        LW: 12 AF: 7
        AF Parent
        
        Natural abstractions argument: in an inner alignment failure, the outer optimizer is optimizing for X, but the inner optimizer ends up pointed at some rough approximation ~X. But if X is a natural abstraction, then this is far less likely to be a problem; we expect a wide range of predictive systems to all learn a basically-correct notion of X, so there’s little reason for an inner optimizer to end up pointed at a rough approximation, especially if we’re leveraging transfer learning from some unsupervised learner.
        
        This isn’t an argument against deceptive alignment, just proxy alignment—with deceptive alignment, the agent still learns X, it just does so as part of its world model rather than its objective. In fact, I think it’s an argument for deceptive alignment, since if X first crops up as a natural abstraction inside of your agent’s world model, that raises the question of how exactly it will get used in the agent’s objective function—and deceptive alignment is arguably one of the simplest, most natural ways for the base optimizer to get an agent that has information about the base objective stored in its world model to actually start optimizing for that model of the base objective.
        johnswentworth Aug 14, 2020, 7:32 PM
        LW: 4 AF: 2
        AF Parent
        I mostly agree with this. I don’t view deception as an inner alignment problem, though—for instance, it’s an issue in any approval-based setup even without an inner optimizer showing up. To the extent that it is an inner alignment issue, it involves generalization failure from the training distribution, which I also generally consider an outer alignment problem (i.e. training on a distribution which differs from the deploy environment generally means the system is not outer aligned, unless the architecture is somehow set up to make the distribution shift irrelevant).
        A useful criterion here: would the problem still happen if we just optimized over all the parameters simultaneously at runtime, rather than training offline first? If the problem would still happen, then it’s not really an inner alignment problem (at least not in the usual mesa-optimization sense).
        evhub Aug 14, 2020, 9:05 PM
        LW: 2 AF: 1
        AF Parent
        
        To the extent that it is an inner alignment issue, it involves generalization failure from the training distribution, which I also generally consider an outer alignment problem (i.e. training on a distribution which differs from the deploy environment generally means the system is not outer aligned, unless the architecture is somehow set up to make the distribution shift irrelevant).
        
        A useful criterion here: would the problem still happen if we just optimized over all the parameters simultaneously at runtime, rather than training offline first? If the problem would still happen, then it’s not really an inner alignment problem (at least not in the usual mesa-optimization sense).
        
        That’s certainly not how I would define inner alignment. In “Risks from Learned Optimization,” we just define it as the problem of aligning the mesa-objective (if one exists) with the base objective, which is entirely independent of whether or not there’s any sort of distinction between the training and deployment distributions and is fully consistent with something like online learning as you’re describing it.
        johnswentworth Aug 14, 2020, 9:36 PM
        LW: 3 AF: 2
        AF Parent
        The way I understood it, the main reason a mesa-optimizer shows up in the first place is that some information is available at runtime which is not available during training, so some processing needs to be done at runtime to figure out the best action given the runtime-info. The mesa-optimizer handles that processing. If we directly optimize over all parameters at runtime, then there’s no place for that to happen.
        What am I missing?
        evhub Aug 14, 2020, 10:30 PM
        LW: 3 AF: 2
        AF Parent
        Let’s consider the following online learning setup:
        
        At each timestep $t$ , $π_{θ_{t}}$ takes action $a_{t} \in A$ and receives reward $r_{t} \in R$ . Then, we perform the simple policy gradient update $θ_{t + 1} = θ_{t} + r_{t} \nabla_{θ} log (P (a_{t} | π_{θ_{t}})) .$
        
        Now, we can ask the question, would $π_{θ_{t}}$ be a mesa-optimizer? The first thing that’s worth noting is that the above setup is precisely the standard RL training setup—the only difference is that there’s no deployment stage. What that means, though, is that if standard RL training produces a mesa-optimizer, then this will produce a mesa-optimizer too, because the training process isn’t different in any way whatsoever. If $π$ is acting in a diverse environment that requires search to be able to be solved effectively, then $π$ will still need to learn to do search—the fact that there won’t ever be a deployment stage in the future is irrelevant to $π$ ’s current training dynamics (unless $π$ is deceptive and knows there won’t be a deployment stage—that’s the only situation where it might be relevant).
        
        Given that, we can ask the question of whether $π$ , if it’s a mesa-optimizer, is likely to be misaligned—and in particular whether it’s likely to be deceptive. Again, in terms of proxy alignment, the training process is exactly the same, so the picture isn’t any different at all—if there are simpler, easier-to-optimize-for proxies, then $π$ is likely to learn those instead of the true base objective. Like I mentioned previously, however, deceptive alignment is the one case where it might matter that you’re doing online learning, since if the model knows that it might do different things based on that fact. However, there are still lots of reasons why a model might be deceptive even in an online learning setup—for example, it might expect better opportunities for defection in the future, and thus want to prevent being modified now so that it can defect when it’ll be most impactful.
        johnswentworth Aug 14, 2020, 11:20 PM
        LW: 5 AF: 4
        AF Parent
        When I say “optimize all the parameters at runtime”, I do not mean “take one gradient step in between each timestep”. I mean, at each timestep, fully optimize all of the parameters. Optimize $θ$ all the way to convergence before every single action.
        Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, “runtime” for mesa-optimization purposes is every time the system chooses its action—i.e. every timestep—and “training” is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps.
        Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be $\sum_{t} r_{t} log (P [a_{t} | π_{θ}])$ , with the sum taken over all previous data points, since that’s what the RL setup is approximating.
        This optimization would probably still “find” the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former “mesa-optimizer” embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want.
        Does that make sense?
        evhub Aug 14, 2020, 11:50 PM
        LW: 2 AF: 1
        AF Parent
        The RL process is actually optimizing $E [\sum_{t} r_{t}]$ , the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy $π^{*}$ such that $π^{*} = {argmax}_{π} E [\sum t r_{t} | π] ?$ In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether $π^{*}$ is aligned or not is an outer alignment question.
        What links here?
        “Inner Alignment Failures” Which Are Actually Outer Alignment Failures by johnswentworth (Oct 31, 2020, 8:18 PM; 66 points)
        Expand this thread
        johnswentworth Aug 15, 2020, 2:25 AM
        LW: 6 AF: 3
        AF Parent
        Sure, $π^{*}$ works for what I’m saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I’m saying that either:
        the mesa optimizer doesn’t appear in $π^{*}$ , in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using $π^{*}$ ), or
        the mesa optimizer does appear in $π^{*}$ , in which case the problem was really an outer alignment issue all along.
      - Ben Pace Aug 14, 2020, 5:50 PM
        LW: 4 AF: 2
        AF Parent
        Thank you for being so clear.
        On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is. (If we’re getting an AGI singleton that want to fully learn our values.)
        Secondly I’ll say that if we do not understand it’s representation of X or X-prime, and if a small difference will be catastrophic, then that will also lead to doom.
        On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that. How much does it have to change for you to feel much safer about inner alignment?
        (I’m also not that clear it only applies to this situation. Perhaps I’m mistaken, but in my head subsystem alignment and robust delegation both have this property of ”build a second optimiser that helps achieve your goals” and in both cases passing on the true utility function seems very hard.)
        johnswentworth Aug 14, 2020, 7:58 PM
        LW: 2 AF: 1
        AF Parent
        On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is.
        Currently, my first-pass check for “is this probably a natural abstraction?” is “can humans usually figure out what I’m talking about from a few examples, without a formal definition?”. For human values, the answer seems like an obvious “yes”. For evolutionary fitness… nonobvious. Humans usually get it wrong without the formal definition.
        Also, natural abstractions in general involve summarizing the information from one chunk of the universe which is relevant “far away”. For human values, the relevant chunk of the universe is the human—i.e. the information about human values is all embedded in the physical human. But for evolutionary fitness, that’s not the case—an organism does not contain all the information relevant to calculating its evolutionary fitness. So it seems like there’s some qualitative difference there—like, human values “live” in humans, but fitness doesn’t “live” in organisms in the same way. I still don’t feel like I fully understand this, though.
        On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that.
        Sure, inner alignment is a problem which mainly applies to architectures similar to modern ML, and modern ML architecture seems like the most-likely route to AGI.
        It still feels like outer alignment is a much harder problem, though. The very fact that inner alignment failure is so specific to certain architectures is evidence that it should be tractable. For instance, we can avoid most inner alignment problems by just optimizing all the parameters simultaneously at run-time. That solution would be too expensive in practice, but the point is that inner alignment is hard in a “we need to find more efficient algorithms” sort of way, not a “we’re missing core concepts and don’t even know how to solve this in principle” sort of way. (At least for mesa-optimization; I agree that there are more general subsystem alignment/robust delegation issues which are potentially conceptually harder.)
        Outer alignment, on the other hand, we don’t even know how to solve in principle, on any architecture whatsoever, even with arbitrary amounts of compute and data. That’s why I expect it to be a bottleneck.
        Vaniver Aug 16, 2020, 4:58 AM
        LW: 14 AF: 5
        AF Parent
        Currently, my first-pass check for “is this probably a natural abstraction?” is “can humans usually figure out what I’m talking about from a few examples, without a formal definition?”. For human values, the answer seems like an obvious “yes”. For evolutionary fitness… nonobvious. Humans usually get it wrong without the formal definition.
        Hmm, presumably you’re not including something like “internal consistency” in the definition of ‘natural abstraction’. That is, humans who aren’t thinking carefully about something will think there’s an imaginable object even if any attempts to actually construct that object will definitely lead to failure. (For example, Arrow’s Impossibility Theorem comes to mind; a voting rule that satisfies all of those desiderata feels like a ‘natural abstraction’ in the relevant sense, even though there aren’t actually any members of that abstraction.)
        johnswentworth Aug 16, 2020, 2:51 PM
        LW: 8 AF: 4
        0
        AF Parent
        Oh this is fascinating. This is basically correct; a high-level model space can include models which do not correspond to any possible low-level model.
        One caveat: any high-level data or observations will be consistent with the true low-level model. So while there may be natural abstract objects which can’t exist, and we can talk about those objects, we shouldn’t see data supporting their existence—e.g. we shouldn’t see a real-world voting system behaving like it satisfies all of Arrow’s desiderata.
        Ben Pace Aug 15, 2020, 12:37 AM
        LW: 2 AF: 1
        AF Parent
        Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
        Regarding your point about the human values living in humans while the organism’s fitness is living partly in the environment, nothing immediately comes to mind to say here, but I agree it’s a very interesting question.
        The things you say about inner/outer alignment hold together quite sensibly. I am surprised to hear you say that mesa optimisers can be avoided by just optimizing all the parameters simultaneously at run-time. That doesn’t match my understanding of mesa optimisation, I thought the mesa optimisers would definitely arise during the training, but if you’re right that it’s trivial-but-expensive to remove them there then I agree it’s intuitively a much easier problem than I had realised.
        johnswentworth Aug 15, 2020, 2:41 AM
        LW: 3 AF: 1
        0
        AF Parent
        Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn’t one of the core points of the reductionism sequence that, while “thor caused the thunder” sounds simpler to a human than Maxwell’s equations (because the words fit naturally into a human psychology), one of them is much “simpler” in an absolute sense than the other (and is in fact true).
        Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
        The same applies to natural abstractions. If I ask people “is ‘tree’ a natural category?” then they’ll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they’ll usually have no trouble at all picking the trees in the second set.
        I thought the mesa optimisers would definitely arise during the training
        If you’re optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during “training” would just be overwritten by the optimal values computed at runtime.
        Ben Pace Aug 15, 2020, 7:05 AM
        LW: 2 AF: 1
        AF Parent
        Despite humans giving really dumb verbal explanations (like “Thor caused the thunder”), we tend to be pretty decent at actually predicting things in practice.
        Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying.
        This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.
        johnswentworth Aug 15, 2020, 4:35 PM
        LW: 7 AF: 3
        0
        AF Parent
        I’m talking everyday situations. Like “if I push on this door, it will open” or “by next week my laundry hamper will be full” or “it’s probably going to be colder in January than June”. Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data.
        In places where the humans in question don’t have much first-hand experiential data, or where the data is mostly noise, that’s where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system’s priors to matter most.) Another way to put it: humans’ priors aren’t great, but in most day-to-day prediction problems we have more than enough data to make up for that.