Any proposed solution to AI Alignment isn’t deconfusion. It might have a bit a deconfusion at the start, and maybe studying it reveal new confusion to solve, but most of it is problem solving instead of deconfusion.
All in all, I think there are many more examples. It’s just that deconfusion almost always plays a part, because we don’t have one unified paradigm or approach which does the deconfusion for us. But actual problem solving and most part of normal science, are not deconfusion by my perspective.
Is there any good AI alignment research that you don’t classify as deconfusion? If so, can you give some examples?
Sure.
Any proposed solution to AI Alignment isn’t deconfusion. It might have a bit a deconfusion at the start, and maybe studying it reveal new confusion to solve, but most of it is problem solving instead of deconfusion.
IDA
Debate
Recursive Reward Modeling
11 proposals
Work in interpretability might involve deconfusion (to clarify what one searches for), but then isn’t deconfusion anymore.
Circuits
Neural net generalization
Knowledge Neurons in Transformers
Just like in normal science, once one has defined a paradigm or a problem, working with it is mostly not deconfusion anymore:
John’s work on the natural abstraction hypothesis
Alex’s use of POWER to study instrumental power-seeking
All in all, I think there are many more examples. It’s just that deconfusion almost always plays a part, because we don’t have one unified paradigm or approach which does the deconfusion for us. But actual problem solving and most part of normal science, are not deconfusion by my perspective.