ryan_greenblatt comments on Framing approaches to alignment and the hard problem of AI cognition

ryan_greenblatt 16 Dec 2021 7:39 UTC
1 point

I understand this exchange as Ryan saying “the goals of AGI must be a perfect match to what we want”, and Jacob as replying “you can’t literally mean perfect, as in not even off by one part per googol, e.g. we bequeath the universe to the next generation despite knowing that they won’t share our values”, and then Ryan is doubling down “Yes I mean perfect”.

Oh, no, this wasn’t what I meant. I just meant that the usage of children as an example was poor because individual children don’t have the potential to succesfully seek vast power. There certainly is a level of sufficient alignment of a just consequentialist utility function which looks like $1 - ϵ$ as opposed to $1$ . I think this $ϵ$ is pretty low, but I reiterate for ‘purely long-run consequentialists’. Note that $ϵ$ must exceptionally low for this sort of AI not to seek power (assuming that avoiding power seeking is desired for the utility function, perhaps we are fine with power seeking as we have the desired consequentialist values, whatever those may be, locked in).

If so, I’m with Jacob. For one thing, if we perfectly nail the AGI’s motivation in regards to transparency, honesty, corrigibility, helpfulness, keeping humans in the loop, etc., but we mess up other aspects of the AGI’s motivation, then the AGI should help us identify and fix the problem

Agreed, but these aren’t consequentialist properties. At least that isn’t how I model them.

I shouldn’t have given such a vague response to the child metaphor.