(Short low-effort reply since we’ll be talking soon.)
we don’t visit those situations during training, but they do in fact come up in practice (distribution shift)
If you’re using this definition of distributional shift, then isn’t any catastrophic misbehaviour a distributional shift problem by definition, since the agent didn’t cause catastrophes in the training environment?
In general I’m not claiming that distributional shift isn’t happening in the leadup to catastrophes, I’m denying that it’s an interesting way to describe what’s going on. An unfair straw analogy: it feels kinda like saying “the main problem in trying to make humans safe is that some humans might live in different places now than we did when we evolved. Especially harmful behaviour could occur under big locational shifts”. Which is… not wrong, most dangerous behaviour doesn’t happen in sub-saharan Africa. But it doesn’t shed much light on what’s happening: the danger is being driven by our cognition, not by high-level shifts in our environments.
(Short low-effort reply since we’ll be talking soon.)
If you’re using this definition of distributional shift, then isn’t any catastrophic misbehaviour a distributional shift problem by definition, since the agent didn’t cause catastrophes in the training environment?
In general I’m not claiming that distributional shift isn’t happening in the leadup to catastrophes, I’m denying that it’s an interesting way to describe what’s going on. An unfair straw analogy: it feels kinda like saying “the main problem in trying to make humans safe is that some humans might live in different places now than we did when we evolved. Especially harmful behaviour could occur under big locational shifts”. Which is… not wrong, most dangerous behaviour doesn’t happen in sub-saharan Africa. But it doesn’t shed much light on what’s happening: the danger is being driven by our cognition, not by high-level shifts in our environments.