If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.
Feels like this would paralyze the AI and make it useless?
The charitable interpretation here is that we’ll compute E[harm|action] (or more generally E[utility|action]) using our posterior over hypothesis and then choose what action to execute based on this. (Or at least we’ll pause and refer actions to humans if E[harm|action] is too high.)
I think “ruling out the possiblity” isn’t really a good frame for thinking about this, and it’s much more natural to just think about this as an estimation procedure which is trying hard to avoid over confidence in out-of-distribution contexts (or more generally, in contexts where training data doesn’t pin down predictions well enough).
ETA: realistically, I think this isn’t going to perform better than ensembling in practice.
Feels like this would paralyze the AI and make it useless?
The charitable interpretation here is that we’ll compute E[harm|action] (or more generally E[utility|action]) using our posterior over hypothesis and then choose what action to execute based on this. (Or at least we’ll pause and refer actions to humans if E[harm|action] is too high.)
I think “ruling out the possiblity” isn’t really a good frame for thinking about this, and it’s much more natural to just think about this as an estimation procedure which is trying hard to avoid over confidence in out-of-distribution contexts (or more generally, in contexts where training data doesn’t pin down predictions well enough).
ETA: realistically, I think this isn’t going to perform better than ensembling in practice.
See discussion here also.