The algorithm “Search over lots of actions to find the one for which Q(a) is maximized” is a pretty good algorithm, that you need to be able to use at test time in order to be competitive, and which seems to require competitiveness.
This search is done by the overseer (or the counterfactual overseer or trained model), and is likely to be safer than the search done by the RL training process, because the overseer has more information about the specific situation that calls for a search, and can tailor the search process to best fit that situation, including how to avoid generating a candidate action that may trigger a flaw in Q, how many candidates to test, etc., whereas the RL search process is dumb and whatever safety mechanisms/constraints that are put on it has to be decided by the AI designer ahead of time and applied to every situation.
This search is done by the overseer (or the counterfactual overseer or trained model), and is likely to be safer than the search done by the RL training process, because the overseer has more information about the specific situation that calls for a search, and can tailor the search process to best fit that situation, including how to avoid generating a candidate action that may trigger a flaw in Q, how many candidates to test, etc., whereas the RL search process is dumb and whatever safety mechanisms/constraints that are put on it has to be decided by the AI designer ahead of time and applied to every situation.