Though the list still doesn’t strike me as very novel—it feels that most of these conditions are conditions we’ve been shooting for anyways.
E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.
If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine generalization in general + the way a thing behaves in a large class of cases seems to be so complicated of a concept that you won’t be able to have confident beliefs about it or understand it. I don’t have a concrete argument about this though.
Anyways, thanks for responding, and if you have any thoughts about the tractability of conditions 3⁄4, I’m pretty curious.
Yes, the list isn’t very novel—I was trying to think of the mix of theoretical and practical results that convince us, in the current world, that a new approach will work. Obviously we want a lot more rigour for something like AI alignment! But there is an urgency to get it fast, too :-(
Thanks for this list!
Though the list still doesn’t strike me as very novel—it feels that most of these conditions are conditions we’ve been shooting for anyways.
E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.
If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine generalization in general + the way a thing behaves in a large class of cases seems to be so complicated of a concept that you won’t be able to have confident beliefs about it or understand it. I don’t have a concrete argument about this though.
Anyways, thanks for responding, and if you have any thoughts about the tractability of conditions 3⁄4, I’m pretty curious.
Yes, the list isn’t very novel—I was trying to think of the mix of theoretical and practical results that convince us, in the current world, that a new approach will work. Obviously we want a lot more rigour for something like AI alignment! But there is an urgency to get it fast, too :-(