Adele Lopez answers What are the known difficulties with this alignment approach?

Adele Lopez Feb 12, 2024, 2:25 AM
5 points
2
There’s nothing stopping the AI from developing its own world model (or if there is, it’s not intelligent enough to be much more useful than whatever process created your starting world model). This will allow it to model itself in more detail than you were able to put in, and to optimize its own workings as is instrumentally convergent. This will result in an intelligence explosion due to recursive self-improvement.

At this point, it will take its optimization target, and put an inconceivably (to humans) huge amount of optimization into it. It will find a flaw in your set up, and exploit it to the extreme.

In general, I think any alignment approach which has any point in which an unfettered intelligence is optimizing for something that isn’t already convergent to human values/CEV is doomed.

Of course, you could add various bounds on it which limit this possibility, but that is in strong tension with its ability to effect the world in significant ways. Maybe you could even get your fusion plant. But how do you use it to steer Earth off its current course and into a future that matters, while still having its own intelligence restrained quite closely?