Current SotA systems are very opaque — we more-or-less can’t inspect or intervene on their thoughts — and it isn’t clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Much more generally: we don’t have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn, given inevitable problems like “your first system will probably be very kludgey” and “having the correct outer training signal by default results in inner misalignment” and “pivotal acts inevitably involve trusting your AGI to do a ton of out-of-distribution cognitive work”.
Current SotA systems are very opaque — we more-or-less can’t inspect or intervene on their thoughts — and it isn’t clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.
I’d guess Nate might say one of:
Current SotA systems are very opaque — we more-or-less can’t inspect or intervene on their thoughts — and it isn’t clear how we could navigate to AI approaches that are far less opaque, and that can carry forward to AGI. (Though it seems very likely such approaches exist somewhere in the space of AI research approaches.)
Much more generally: we don’t have a alignment approach that could realistically work fast (say, within ten months of inventing AGI rather than ten years), in the face of a sharp left turn, given inevitable problems like “your first system will probably be very kludgey” and “having the correct outer training signal by default results in inner misalignment” and “pivotal acts inevitably involve trusting your AGI to do a ton of out-of-distribution cognitive work”.
Yeah, it does seem like interpreterability is a bottleneck for a lot of alignment proposals, and in particular as long as neutral networks are essentially black boxes, deceptive alignment/inner alignment issues seem almost impossible to address.
Seems right to me.