paulfchristiano comments on Can corrigibility be learned safely?

paulfchristiano 7 May 2018 1:59 UTC
LW: 2 AF: 1
0
AF
It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step.
I agree, I think it’s unlikely the final scheme will involve doing this work in two places.
Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
This a way that things could end up looking. I think there are more natural ways to do this integration though.
Note that in order for any of this to work, amplification probably needs to be able to replicate/verify all (or most) of the cognitive work the ML model does implicitly, so that we can do informed oversight. There w opaque heuristics that “just work,” which are discovered either by ML or metaexecution trial-and-error, but then we need to confirm safety for those heuristics.