ryan_greenblatt comments on Bengio’s Alignment Proposal: “Towards a Cautious Scientist AI with Convergent Safety Bounds”

ryan_greenblatt 2 Mar 2024 3:15 UTC
2 points
0
The charitable interpretation here is that we’ll compute E[harm|action] (or more generally E[utility|action]) using our posterior over hypothesis and then choose what action to execute based on this. (Or at least we’ll pause and refer actions to humans if E[harm|action] is too high.)

I think “ruling out the possiblity” isn’t really a good frame for thinking about this, and it’s much more natural to just think about this as an estimation procedure which is trying hard to avoid over confidence in out-of-distribution contexts (or more generally, in contexts where training data doesn’t pin down predictions well enough).

ETA: realistically, I think this isn’t going to perform better than ensembling in practice.

See discussion here also.