It seems to me that no matter how many problems from different research agendas we solve, there is always the possibility that some ‘unknown unknown’ misalignment scenario can occur. I can imagine an approach of building model-agnostic, environment-agnostic minimal assumption alignment guarantees (which seems to be super hard), but I feel like things can go wrong in myriad other ways, even then.
Has there been any discussion about how we might go about this?
Donald Hobson gives a comment below explaining some reasoning around dealing with unknown unknowns, but it’s not a direct answer to the question, so I’ll offer that.
The short answer is “yes”.
The longer answer is that this is one of the fundamental considerations in approaching AI alignment and is why some organizations, like MIRI, have taken an approach that doesn’t drive straight at the object-level problem and instead tackles issues likely to be foundational to any approach to alignment that could work. In fact you might say the big schism between MIRI and, say, OpenAI, is that MIRI places greater emphasis on addressing the unknown whereas OpenAI expects alignment to look more like an engineering problem with relatively small and not especially dangerous unknown unknowns.
(note: I am not affiliated with either organization so this is an informed opinion on their general approaches, and also note that neither organization is monolithic and individual researches vary greatly in their assessment of these risks.)
My own efforts addressing AI alignment are largely about addressing these sorts of questions, because I think we still poorly understand what alignment even really means. In this sense I know that there is a lot we don’t know, but I don’t know all of what we don’t know that we’ll need to (so known unknown unknowns).