Thanks, I’m planning to release an advent calendar of hot takes and this gives me fodder for a few :P
My short notes that I’ll expand in the advent calendar:
The notions of inner and outer alignment make a lot of sense in model-free deep RL, and a related kind of sense in RL using RNNs or black-box program search, but other architectures will have different generalization problems that present differently. Model-based deep RL, in particular, has its own generalization problems but none of them are really “inner alignment.”
No, IDA and RLHF are not solutions to outer alignment. Alignment solutions that work only if humans converge to sensible behavior will not work. Humans in bureaucracies can do things that don’t serve the interests of the whole. Humans can be deceived (e.g. claw in front of ball example from RLHF). [aside: I wrote what I think is an interesting post about HCH / IDA.]
Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA
Thanks, I’m planning to release an advent calendar of hot takes and this gives me fodder for a few :P
My short notes that I’ll expand in the advent calendar:
The notions of inner and outer alignment make a lot of sense in model-free deep RL, and a related kind of sense in RL using RNNs or black-box program search, but other architectures will have different generalization problems that present differently. Model-based deep RL, in particular, has its own generalization problems but none of them are really “inner alignment.”
No, IDA and RLHF are not solutions to outer alignment. Alignment solutions that work only if humans converge to sensible behavior will not work. Humans in bureaucracies can do things that don’t serve the interests of the whole. Humans can be deceived (e.g. claw in front of ball example from RLHF). [aside: I wrote what I think is an interesting post about HCH / IDA.]
Human bureaucracies are mostly misaligned because the actual bureaucratic actors are also misaligned. I think a “bureaucracy” of perfectly aligned humans (like EA but better) would be well aligned. RLHF is obviously not a solution in the limit but I don’t think it’s extremely implausible that it is outer aligned enough to work, though I am much more enthusiastic about IDA