All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I’ve ever heard of anyone actually doing in any of those categories.
That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.
EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your “science of generalization” category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.
I think we would all be interested to hear you elaborate on why you think these approaches have approximately no value. Perhaps this will be in a follow-up post.
All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI).
It’s difficult for me to understand how this could be “basically useless in practice” for:
scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it’s maybe a bit cost uncompetitive)?
It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.
All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I’ve ever heard of anyone actually doing in any of those categories.
That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.
EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your “science of generalization” category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.
I think we would all be interested to hear you elaborate on why you think these approaches have approximately no value. Perhaps this will be in a follow-up post.
It’s difficult for me to understand how this could be “basically useless in practice” for:
It seems to me you’d want to understand and strongly show how and why different approaches here fail, and in any world where you have something like “outsourcing alignment research” you want some form of oversight.