Max Harms comments on 3b. Formal (Faux) Corrigibility

Max Harms 11 Jun 2024 16:33 UTC
LW: 1 AF: 1
0
AF
Yep.sim is additionally bad because it prevents the AI from meaningfully defending against manipulation by others. It’s worse than that, even, since the AI can’t even let the principal use general tools the AI provides (i.e. a fortress) to defend against being manipulated from outside. In the limit, this might result in the AI manipulating the principals on the behalf of others who would’ve counterfactually influenced them. I consider the version I’ve provided to be obviously inadequate, and this is another pointer as to why.

Towards the end of the document, when I discuss time, I mention that it probably makes sense to take the P(V|pi_0) counterfactual for just the expected timestep, rather than across a broader swath of time. This helps alleviate some of the weirdness. Consider, for instance, a setup where the AI uses a quantum coin to randomly take no action with a 1/10^30 chance each minute, and otherwise it acts normally. We might model P(V|pi_0) as the machine’s model of what the principal’s values would be like if it randomly froze due to the quantum coin. Because it’s localized in time I expect this is basically just “what the human currently values if the AI isn’t taking immediate actions.” This version of the AI would certainly be able to help defend the principal from outside manipulation, such as by (on demand) building the principal a secure fortress. Even though in aggregate that principal’s values diverge from the counterfactual where the AI always flipped the coin such that it took no action, the principal’s values will probably be very similar to a counterfactual where the coin flip caused the machine to freeze for one minute.

Apologies for the feeling of a rug-pull. I do think corrigibility is a path to avoiding to having to have an a-priori understanding of human values, but I admit that the formalism proposed here involves the machine needing to develop at least a rough understanding of human values so that it knows how to avoid (locally) disrupting them. I think these are distinct features, and that corrigibility remains promising in how it sidesteps the need for an a-priori model. I definitely agree that it’s disheartening how little progress there’s been on this front over the years.