You could try to avoid the RL exploiting a security vulnerability in the overseer by:
Ok, would be interested to learn more about the solution you think is most promising here. Is it written up anywhere yet, or there’s something in the ML or AI safety literature that you can point me to?
I mostly hope to solve this problem with security amplification (see also).
I think when I wrote my comment I already internalized “security amplification” and was assuming a LBO/meta-execution style system, but was worried about flaws introduced by meta-execution itself (since doing meta-execution seems like programming in that you have to break tasks down into sub-tasks and can make a mistake in the process). This seems related to what you wrote in the security amplification post:
Note that security amplification + distillation will only remove the vulnerabilities that came from the human. We will still be left with vulnerabilities introduced by our learning process, and with any inherent limits on our model’s ability to represent/learn a secure policy. So we’ll have to deal with those problems separately.
Ok, would be interested to learn more about the solution you think is most promising here. Is it written up anywhere yet, or there’s something in the ML or AI safety literature that you can point me to?
I think when I wrote my comment I already internalized “security amplification” and was assuming a LBO/meta-execution style system, but was worried about flaws introduced by meta-execution itself (since doing meta-execution seems like programming in that you have to break tasks down into sub-tasks and can make a mistake in the process). This seems related to what you wrote in the security amplification post: