If we can test whether the model is behaving badly on a given input, then we can use Opt to search for any input where the model behaves badly. So we can end up with a system that works well on distribution and doesn’t work poorly off distribution. If it’s possible to handle outer alignment in this setting, I expect inner alignment will also be possible. (Though it’s not obvious, since you might take longer to learn given an unaligned learned optimizer.)
In the case of deceptive alignment, our ability to test whether the model behaves badly on input x affects the behavior of the model on input x (and similarly, our ability to come up with a U s.t.Opt(U) allows us to find—xif the model behaves badly on input—xaffects the behavior of the model on input x).
Therefore, to the extent that deceptive alignment is plausible in programs that Opt outputs, the inner alignment problem seems to me very hard.
That seems fine though. If the model behaves badly on any input we can test that. If the model wants to behave well on every input, then we’re happy. If it wants to behave badly on some input, we’ll catch it.
Are you concerned that we can’t test whether the model is behaving badly on a particular input? I think if you have that problem you are also in trouble for outer alignment.
When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don’t understand how we can use Opt to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).
When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
That’s the hope. (Though I assume we mostly get it by an application of Opt, or more efficiently by modifying our original invocation of Opt to return a program with some useful auxiliary functions, rather than by writing it by hand.)
If we can test whether the model is behaving badly on a given input, then we can use Opt to search for any input where the model behaves badly. So we can end up with a system that works well on distribution and doesn’t work poorly off distribution. If it’s possible to handle outer alignment in this setting, I expect inner alignment will also be possible. (Though it’s not obvious, since you might take longer to learn given an unaligned learned optimizer.)
In the case of deceptive alignment, our ability to test whether the model behaves badly on input x affects the behavior of the model on input x (and similarly, our ability to come up with a U s.t.Opt(U) allows us to find—xif the model behaves badly on input—xaffects the behavior of the model on input x).
Therefore, to the extent that deceptive alignment is plausible in programs that Opt outputs, the inner alignment problem seems to me very hard.
That seems fine though. If the model behaves badly on any input we can test that. If the model wants to behave well on every input, then we’re happy. If it wants to behave badly on some input, we’ll catch it.
Are you concerned that we can’t test whether the model is behaving badly on a particular input? I think if you have that problem you are also in trouble for outer alignment.
When you say “test” do you mean testing by writing a single program that outputs whether the model performs badly on a given input (for any input)?
If so, I’m concerned that we won’t be able to write such a program.
If not (i.e. if we only assume that human researchers can safely figure out whether the model behaves badly on a given input), then I don’t understand how we can use Opt to find an input that the model behaves badly on (in a way that would work even if deceptive alignment occurs).
That’s the hope. (Though I assume we mostly get it by an application of Opt, or more efficiently by modifying our original invocation of Opt to return a program with some useful auxiliary functions, rather than by writing it by hand.)