any agent that “plans to reach imagined futures”, with some implicit “preferences over futures”, exhibits instrumental convergence.
Actually, I don’t think this is true. For example take a chess playing program which imagines winning, and searches for strategies to reach that goal. Instrumental convergence would assume that the program would resist being turned off, try to get more computational resources, or try to drug/hack the opponent to make them weaker. However, the planning process could easily be restricted to chess moves, where none of these would be found, and thus would not exhibit instrumental convergence. This sort of “safety measure” isn’t very reliable, especially when we’re dealing with the real world rather than a game. However it is possible for an agent to be a utility maximizer, or to have some utility maximizing subroutines, and still not exhibit instrumental convergence.
There is another, more philosophical question of what is and isn’t a preference over futures. I believe that there can be a brilliant chess player that does not actually prefer winning to losing. But the relevant terms are vague enough that it’s a pain to think about them.
I think proponents of the instrumental convergence thesis would expect a consequentialist chess program to exhibit instrumental convergence in the domain of chess. So if there were some (chess-related) subplan that was useful in lots of other (chess-related) plans, we would see the program execute that subplan a lot. The important difference would be that the chess program uses an ontology of chess while unsafe programs use an ontology of nature.
First, Nick Bostrom has an example where a Riemann hypothesis solving machine converts the earth into computronium. I imagine he’d predict the same for a chess program, regardless of what ontology it uses.
Second, if instrumental convergence was that easy to solve (the convergence in the domain of chess is harmless), it wouldn’t really be an interesting problem.
Actually, I don’t think this is true. For example take a chess playing program which imagines winning, and searches for strategies to reach that goal. Instrumental convergence would assume that the program would resist being turned off, try to get more computational resources, or try to drug/hack the opponent to make them weaker. However, the planning process could easily be restricted to chess moves, where none of these would be found, and thus would not exhibit instrumental convergence. This sort of “safety measure” isn’t very reliable, especially when we’re dealing with the real world rather than a game. However it is possible for an agent to be a utility maximizer, or to have some utility maximizing subroutines, and still not exhibit instrumental convergence.
There is another, more philosophical question of what is and isn’t a preference over futures. I believe that there can be a brilliant chess player that does not actually prefer winning to losing. But the relevant terms are vague enough that it’s a pain to think about them.
I think proponents of the instrumental convergence thesis would expect a consequentialist chess program to exhibit instrumental convergence in the domain of chess. So if there were some (chess-related) subplan that was useful in lots of other (chess-related) plans, we would see the program execute that subplan a lot. The important difference would be that the chess program uses an ontology of chess while unsafe programs use an ontology of nature.
First, Nick Bostrom has an example where a Riemann hypothesis solving machine converts the earth into computronium. I imagine he’d predict the same for a chess program, regardless of what ontology it uses.
Second, if instrumental convergence was that easy to solve (the convergence in the domain of chess is harmless), it wouldn’t really be an interesting problem.