TurnTrout comments on The curious case of Pretty Good human inner/outer alignment

TurnTrout 11 Jul 2022 4:54 UTC
3 points
0
Is this robust non-maximalness an emerging quality of some or all very smart agents?
Yeah, I suspect it’s actually pretty hard to get a mesa-optimizer which maximizes some simple, internally represented utility function. I am seriously considering a mechanistic hypothesis where “robust non-maximalness” is the default. That, on its own, does not guarantee safety, but I think it’s pretty interesting.

TurnTrout comments on The curious case of Pretty Good human inner/​outer alignment