Tom Davidson comments on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Tom Davidson 1 Aug 2023 16:04 UTC
LW: 15 AF: 8
0
AF
Linking to a post I wrote on a related topic, where I sketch a process (see diagram) for using this kind of red-teaming to iteratively improve your oversight process. (I’m more focussed on a scenario where you’re trying to offload as much of the work in evaluating and improving your oversight process to AIs)