I’m not sure exactly how important goal-optimisation is. I think AIs are overwhelmingly likely to fail to act as if they were universally optimising for simple goals compared to some counterfactual “perfect optimiser with equivalent capability”, but this is failure only matters if the dangerous behaviour is only executed by the perfect optimiser.
They’re also very likely to act as if they are optimising for some simple goal X in circumstances Y under side conditions Z (Y and Z may not be simple) - in fact, they already do. This could easily be enough for dangerous behaviour, especially if in practice there’s a lot of variation in X, Y and Z. Subject to restrictions imposed by Y and Z, instrumental convergence still applies.
A particular worry is if dangerous behaviour is easy. Suppose it’s completely trivial: there’s a game of Go where placing the right 5-stone sequence kills everybody and awards the stone placer the win. You have a smart AI (that already “knows” how to win at Go, and about the five stone technique) that you want to play the game for you. You use some method to try to direct it to play games of Go. Unless it has particular reason to ignore the 5-stone sequence, it will probably consider it about its top moves, even if it’s simultaneously prone to getting distracted by butterflies, or if it misunderstands your request to be about playing go only on sunny days. It just comes down to the fact that the 5-stone sequence is a strong move that’s easy to know about.
I’m not sure exactly how important goal-optimisation is. I think AIs are overwhelmingly likely to fail to act as if they were universally optimising for simple goals compared to some counterfactual “perfect optimiser with equivalent capability”, but this is failure only matters if the dangerous behaviour is only executed by the perfect optimiser.
They’re also very likely to act as if they are optimising for some simple goal X in circumstances Y under side conditions Z (Y and Z may not be simple) - in fact, they already do. This could easily be enough for dangerous behaviour, especially if in practice there’s a lot of variation in X, Y and Z. Subject to restrictions imposed by Y and Z, instrumental convergence still applies.
A particular worry is if dangerous behaviour is easy. Suppose it’s completely trivial: there’s a game of Go where placing the right 5-stone sequence kills everybody and awards the stone placer the win. You have a smart AI (that already “knows” how to win at Go, and about the five stone technique) that you want to play the game for you. You use some method to try to direct it to play games of Go. Unless it has particular reason to ignore the 5-stone sequence, it will probably consider it about its top moves, even if it’s simultaneously prone to getting distracted by butterflies, or if it misunderstands your request to be about playing go only on sunny days. It just comes down to the fact that the 5-stone sequence is a strong move that’s easy to know about.