I do think both of those cases fit into the framework fine (unless I’m misunderstanding what you have in mind):
In the first case, we’re training a model in an environment. As it gets more capable, it reaches a point where it can find new, harmful behaviors in some set of situations. Our worries are now that (1) we can’t recognize that behavior as harmful, or (2) we don’t visit those situations during training, but they do in fact come up in practice (distribution shift). If we say “but the version of the model we had yesterday, before all this additional training, didn’t behave badly in this situation!”, that just seems like sloppy training work—it’s not clear why we should expect the behavior of an earlier version of a model to bind a later version.
In the second case, it sounds like you’re imagining us watching evolution and thinking “let’s evolve humans that are reproductively fit, but aren’t dangerous to other species.” We train the humans a lot in the ancestral environment, and see that they don’t hurt other species much. But then, the humans change the environment a lot, and in the new situations they create, they hurt other species a lot. In this case, I think it’s pretty clear that the distribution has shifted. We might wish we’d done something earlier to certify that humans wouldn’t hurt animals a lot under any circumstance, or we’d deployed humans in some sandbox so we could keep the high-level distribution of situations the same, or dealt with high-level distribution shift some other way.
In other words, if we imagine a model misbehaving in the wild, I think it’ll usually either be the case that (1) it behaved that way during training but we didn’t notice the badness (evaluation breakdown), or (2) we didn’t train it on a similar enough situation (high-level distribution shift).
As we move further away from standard DL training practices, we could see failure modes that don’t fit into these two categories—e.g. there could be some bad fixed-point behaviors in amplification that aren’t productively thought of as “evaluation breakdown” or “high-level distribution shift.” But these two categories do seem like the most obvious ways that current DL practice could produce systematically harmful behavior, and I think they take up a pretty large part of the space of possible failures.
(ETA: I want to reiterate that these two problems are restatements of earlier thinking, esp. by Paul and Evan, and not ideas I’m claiming are new at all; I’m using my own terms for them because “inner” and “outer” alignments have different meanings for different people.)
(Short low-effort reply since we’ll be talking soon.)
we don’t visit those situations during training, but they do in fact come up in practice (distribution shift)
If you’re using this definition of distributional shift, then isn’t any catastrophic misbehaviour a distributional shift problem by definition, since the agent didn’t cause catastrophes in the training environment?
In general I’m not claiming that distributional shift isn’t happening in the leadup to catastrophes, I’m denying that it’s an interesting way to describe what’s going on. An unfair straw analogy: it feels kinda like saying “the main problem in trying to make humans safe is that some humans might live in different places now than we did when we evolved. Especially harmful behaviour could occur under big locational shifts”. Which is… not wrong, most dangerous behaviour doesn’t happen in sub-saharan Africa. But it doesn’t shed much light on what’s happening: the danger is being driven by our cognition, not by high-level shifts in our environments.
Thanks, Richard!
I do think both of those cases fit into the framework fine (unless I’m misunderstanding what you have in mind):
In the first case, we’re training a model in an environment. As it gets more capable, it reaches a point where it can find new, harmful behaviors in some set of situations. Our worries are now that (1) we can’t recognize that behavior as harmful, or (2) we don’t visit those situations during training, but they do in fact come up in practice (distribution shift). If we say “but the version of the model we had yesterday, before all this additional training, didn’t behave badly in this situation!”, that just seems like sloppy training work—it’s not clear why we should expect the behavior of an earlier version of a model to bind a later version.
In the second case, it sounds like you’re imagining us watching evolution and thinking “let’s evolve humans that are reproductively fit, but aren’t dangerous to other species.” We train the humans a lot in the ancestral environment, and see that they don’t hurt other species much. But then, the humans change the environment a lot, and in the new situations they create, they hurt other species a lot. In this case, I think it’s pretty clear that the distribution has shifted. We might wish we’d done something earlier to certify that humans wouldn’t hurt animals a lot under any circumstance, or we’d deployed humans in some sandbox so we could keep the high-level distribution of situations the same, or dealt with high-level distribution shift some other way.
In other words, if we imagine a model misbehaving in the wild, I think it’ll usually either be the case that (1) it behaved that way during training but we didn’t notice the badness (evaluation breakdown), or (2) we didn’t train it on a similar enough situation (high-level distribution shift).
As we move further away from standard DL training practices, we could see failure modes that don’t fit into these two categories—e.g. there could be some bad fixed-point behaviors in amplification that aren’t productively thought of as “evaluation breakdown” or “high-level distribution shift.” But these two categories do seem like the most obvious ways that current DL practice could produce systematically harmful behavior, and I think they take up a pretty large part of the space of possible failures.
(ETA: I want to reiterate that these two problems are restatements of earlier thinking, esp. by Paul and Evan, and not ideas I’m claiming are new at all; I’m using my own terms for them because “inner” and “outer” alignments have different meanings for different people.)
(Short low-effort reply since we’ll be talking soon.)
If you’re using this definition of distributional shift, then isn’t any catastrophic misbehaviour a distributional shift problem by definition, since the agent didn’t cause catastrophes in the training environment?
In general I’m not claiming that distributional shift isn’t happening in the leadup to catastrophes, I’m denying that it’s an interesting way to describe what’s going on. An unfair straw analogy: it feels kinda like saying “the main problem in trying to make humans safe is that some humans might live in different places now than we did when we evolved. Especially harmful behaviour could occur under big locational shifts”. Which is… not wrong, most dangerous behaviour doesn’t happen in sub-saharan Africa. But it doesn’t shed much light on what’s happening: the danger is being driven by our cognition, not by high-level shifts in our environments.