Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes.
I agree with the “X is safer than Y” claim; I am uncertain whether it’s practically available to us, and much more worried in worlds where it isn’t available.
incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy
For this specific proposal, when I reframe it as “give the system a KL-divergence budget to spend on each change to its policy” I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space.
the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn’t really have goals and isn’t good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.
Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it’s the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it’d be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.
I agree with the “X is safer than Y” claim; I am uncertain whether it’s practically available to us, and much more worried in worlds where it isn’t available.
For this specific proposal, when I reframe it as “give the system a KL-divergence budget to spend on each change to its policy” I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space.
Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it’s the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it’d be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.