I really like this. Ever since I read your first model splintering post, it’s been a central part of my thinking too.
I feel cautiously optimistic about the prospects for generating multiple hypotheses and detecting when they come into conflict out-of-distribution (although the details are kinda different for the Bayes-net-ish models that I tend to think about then the deep neural net models that I understand y’all are thinking about).
I remain much more confused about what to do when that detector goes off, in a future AGI.
I imagine a situation where some inscrutably complicated abstract concept in the AGI’s world-model comes apart / splinters away from a different inscrutably complicated abstract concept in the AGI’s world-model. OK, what do we do now?
Ideally the AGI would query the human about what to do. But I don’t know how one might write code that does that. It’s not like an image classifier where you can just print out some pictures and show them to the person.
(One solution would be: leverage the AGI’s own intelligence for how to query the human, by making the AGI motivated to learn what the human in fact wants in that ambiguous situation. But then it stops being a safety feature for that aspect of the AGI’s motivation itself.)
I really like this. Ever since I read your first model splintering post, it’s been a central part of my thinking too.
I feel cautiously optimistic about the prospects for generating multiple hypotheses and detecting when they come into conflict out-of-distribution (although the details are kinda different for the Bayes-net-ish models that I tend to think about then the deep neural net models that I understand y’all are thinking about).
I remain much more confused about what to do when that detector goes off, in a future AGI.
I imagine a situation where some inscrutably complicated abstract concept in the AGI’s world-model comes apart / splinters away from a different inscrutably complicated abstract concept in the AGI’s world-model. OK, what do we do now?
Ideally the AGI would query the human about what to do. But I don’t know how one might write code that does that. It’s not like an image classifier where you can just print out some pictures and show them to the person.
(One solution would be: leverage the AGI’s own intelligence for how to query the human, by making the AGI motivated to learn what the human in fact wants in that ambiguous situation. But then it stops being a safety feature for that aspect of the AGI’s motivation itself.)