Against mimicry is mostly motivated by the case of imitating an amplified agent. (I try to separate the problem of distillation and amplification, and imitation learning is a candidate for mimicry.)
You could try to avoid the RL exploiting a security vulnerability in the overseer by:
Doing something like quantilizing where you are constrained to be near the original policy (we impose a KL constraint that prevents the policy from drifting too far from an attempted-imitation).
Doing something like meeting halfway.
These solutions seem tricky but maybe helpful. But my bigger concern is that you need to fix security vulnerabilities anyway:
The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
Iterated amplification does optimization anyway (by amplifying the optimization done by the individual humans) and without security you are going to have problems there.
The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
This search is done by the overseer (or the counterfactual overseer or trained model), and is likely to be safer than the search done by the RL training process, because the overseer has more information about the specific situation that calls for a search, and can tailor the search process to best fit that situation, including how to avoid generating a candidate action that may trigger a flaw in Q, how many candidates to test, etc., whereas the RL search process is dumb and whatever safety mechanisms/constraints that are put on it has to be decided by the AI designer ahead of time and applied to every situation.
You could try to avoid the RL exploiting a security vulnerability in the overseer by:
Ok, would be interested to learn more about the solution you think is most promising here. Is it written up anywhere yet, or there’s something in the ML or AI safety literature that you can point me to?
I mostly hope to solve this problem with security amplification (see also).
I think when I wrote my comment I already internalized “security amplification” and was assuming a LBO/meta-execution style system, but was worried about flaws introduced by meta-execution itself (since doing meta-execution seems like programming in that you have to break tasks down into sub-tasks and can make a mistake in the process). This seems related to what you wrote in the security amplification post:
Note that security amplification + distillation will only remove the vulnerabilities that came from the human. We will still be left with vulnerabilities introduced by our learning process, and with any inherent limits on our model’s ability to represent/learn a secure policy. So we’ll have to deal with those problems separately.
Against mimicry is mostly motivated by the case of imitating an amplified agent. (I try to separate the problem of distillation and amplification, and imitation learning is a candidate for mimicry.)
You could try to avoid the RL exploiting a security vulnerability in the overseer by:
Doing something like quantilizing where you are constrained to be near the original policy (we impose a KL constraint that prevents the policy from drifting too far from an attempted-imitation).
Doing something like meeting halfway.
These solutions seem tricky but maybe helpful. But my bigger concern is that you need to fix security vulnerabilities anyway:
The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
Iterated amplification does optimization anyway (by amplifying the optimization done by the individual humans) and without security you are going to have problems there.
I mostly hope to solve this problem with security amplification (see also).
This search is done by the overseer (or the counterfactual overseer or trained model), and is likely to be safer than the search done by the RL training process, because the overseer has more information about the specific situation that calls for a search, and can tailor the search process to best fit that situation, including how to avoid generating a candidate action that may trigger a flaw in Q, how many candidates to test, etc., whereas the RL search process is dumb and whatever safety mechanisms/constraints that are put on it has to be decided by the AI designer ahead of time and applied to every situation.
Ok, would be interested to learn more about the solution you think is most promising here. Is it written up anywhere yet, or there’s something in the ML or AI safety literature that you can point me to?
I think when I wrote my comment I already internalized “security amplification” and was assuming a LBO/meta-execution style system, but was worried about flaws introduced by meta-execution itself (since doing meta-execution seems like programming in that you have to break tasks down into sub-tasks and can make a mistake in the process). This seems related to what you wrote in the security amplification post: