I read that sequence a couple months ago (in preparation for writing §2.7 here), and found it helpful, thanks.
To give some quotes from that…
I agree that we’re probably on basically the same page.
So, it seems like we don’t want “perfect inner alignment,”
FYI Alex also has this post making a similar point.
Idk, the whole thing seems to me like brewing a potion in Harry Potter
I think I agree, in that I’m somewhat pessimistic about plans wherein we want the “adult AI” to have object-level goal X, and so we find a reward function and training environment where that winds up happening.
Not that such a plan would definitely fail (e.g. lots of human adults are trying to take care of their children), just that it doesn’t seem like the kind of approach that passes the higher bar of having a strong reason to expect success (e.g. lots of human adults are not trying to take care of their children). (See here for someone trying to flesh out this kind of approach.)
So anyway, my take right now is basically:
If we want the “adult AGI” to be trying to do a particular thing (‘make nanobots’, or ‘be helpful towards its supervisor’, or whatever), we should replace (or at least supplement) a well-chosen reward function with a more interpretability-based approach; for example, see Plan for mediocre alignment of brain-like [model-based RL] AGI (which is a simplified version of Post 14 of this series)
Or we can have a similar relation to AGIs that we have to the next generation of humans: We don’t know exactly at the object level what they will be trying to do and why, but they basically have “good hearts” and so we trust their judgment.
These two bullet points correspond to the “two paths forward” of Post 12 of this series.
I read that sequence a couple months ago (in preparation for writing §2.7 here), and found it helpful, thanks.
I agree that we’re probably on basically the same page.
FYI Alex also has this post making a similar point.
I think I agree, in that I’m somewhat pessimistic about plans wherein we want the “adult AI” to have object-level goal X, and so we find a reward function and training environment where that winds up happening.
Not that such a plan would definitely fail (e.g. lots of human adults are trying to take care of their children), just that it doesn’t seem like the kind of approach that passes the higher bar of having a strong reason to expect success (e.g. lots of human adults are not trying to take care of their children). (See here for someone trying to flesh out this kind of approach.)
So anyway, my take right now is basically:
If we want the “adult AGI” to be trying to do a particular thing (‘make nanobots’, or ‘be helpful towards its supervisor’, or whatever), we should replace (or at least supplement) a well-chosen reward function with a more interpretability-based approach; for example, see Plan for mediocre alignment of brain-like [model-based RL] AGI (which is a simplified version of Post 14 of this series)
Or we can have a similar relation to AGIs that we have to the next generation of humans: We don’t know exactly at the object level what they will be trying to do and why, but they basically have “good hearts” and so we trust their judgment.
These two bullet points correspond to the “two paths forward” of Post 12 of this series.