Daniel Kokotajlo comments on Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds”

Daniel Kokotajlo 1 Mar 2024 21:24 UTC
4 points
3
If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.

Feels like this would paralyze the AI and make it useless?
- ryan_greenblatt 2 Mar 2024 3:15 UTC
  2 points
  0
  Parent
  The charitable interpretation here is that we’ll compute E[harm|action] (or more generally E[utility|action]) using our posterior over hypothesis and then choose what action to execute based on this. (Or at least we’ll pause and refer actions to humans if E[harm|action] is too high.)
  
  I think “ruling out the possiblity” isn’t really a good frame for thinking about this, and it’s much more natural to just think about this as an estimation procedure which is trying hard to avoid over confidence in out-of-distribution contexts (or more generally, in contexts where training data doesn’t pin down predictions well enough).
  
  ETA: realistically, I think this isn’t going to perform better than ensembling in practice.
  
  See discussion here also.