It was a relatively fringe topic that only recently got the attention of a large number of real researchers. And parts of it could need large amounts of computational power afforded by only by superhuman narrow AI.
There have been a few random phd dissertations saying the topic is hard but as far as I can tell there has only recently been push for a group effort by capable and well funded actors (I.e. openAI’s interpretability research).
I don’t trust older alignment research much as an outsider. It seems to me that Yud has built a cult of personality around AI dooming and thus is motivated to find reasons for alignment not being possible. And most of his followers treat his initial ideas as axiomatic principles and don’t dare to challenge them. And lastly most past alignment research seems to be made by those followers.
Unfortunately, we do not have the luxury of experimenting with dangerous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.
For example this is an argument that has been convincingly disputed to varying levels (warning shots, incomputability of most plans of danger) but it is still treated as a fundamental truth on this site.
I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.
Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.