Current SotA systems are very opaque — we more-or-less can’t inspect or intervene on their thoughts — and it isn’t clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.
Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.
Seems right to me.