Max Harms comments on 3b. Formal (Faux) Corrigibility

Max Harms 13 Jun 2024 18:40 UTC
LW: 1 AF: 1
0
AF
Thanks. Picking out those excerpts is very helpful.
I’ve jotted down my current (confused) thoughts about human values.
But yeah, I basically think one needs to start with a hodgepodge of examples that are selected for being conservative and uncontroversial. I’d collect them by first identifying a robust set of very in-distribution tasks and contexts and try to exhaustively identify what manipulation would look like in that small domain, then aggressively train on passivity outside of that known distribution. The early pseudo-agent will almost certainly be mis-generalizing in a bunch of ways, but if it’s set up cautiously we can suspect that it’ll err on the side of caution, and that this can be gradually peeled back in a whitelist-style way as the experimentation phase proceeds and attempts to nail down true corrigibility.