It seems clear that an agent that “maximizes utility” exhibits instrumental convergence. I think we can state a stronger claim: any agent that “plans to reach imagined futures”, with some implicit “preferences over futures”, exhibits instrumental convergence.
The question then is how much can you weaken the constraint “looks like a utility maximizer”, before instrumental convergence breaks? Where is the point in between “formless program” and “selects preferred imagined futures” at which instrumental convergence starts/stops applying?
---
This moves in the direction of working out exactly which components of utility-maximizing behaviour are necessary. (Personally, I think you might only need to assume “backchaining”.)
So, I’m curious: What do you think a minimal set of necessary pieces might be, before a program is close enough to “goal directed” for instrumental convergence to apply?
This might be a difficult question to answer, but it’s probably a good way to understand why instrumental convergence feels so real to other people.
any agent that “plans to reach imagined futures”, with some implicit “preferences over futures”, exhibits instrumental convergence.
Actually, I don’t think this is true. For example take a chess playing program which imagines winning, and searches for strategies to reach that goal. Instrumental convergence would assume that the program would resist being turned off, try to get more computational resources, or try to drug/hack the opponent to make them weaker. However, the planning process could easily be restricted to chess moves, where none of these would be found, and thus would not exhibit instrumental convergence. This sort of “safety measure” isn’t very reliable, especially when we’re dealing with the real world rather than a game. However it is possible for an agent to be a utility maximizer, or to have some utility maximizing subroutines, and still not exhibit instrumental convergence.
There is another, more philosophical question of what is and isn’t a preference over futures. I believe that there can be a brilliant chess player that does not actually prefer winning to losing. But the relevant terms are vague enough that it’s a pain to think about them.
I think proponents of the instrumental convergence thesis would expect a consequentialist chess program to exhibit instrumental convergence in the domain of chess. So if there were some (chess-related) subplan that was useful in lots of other (chess-related) plans, we would see the program execute that subplan a lot. The important difference would be that the chess program uses an ontology of chess while unsafe programs use an ontology of nature.
First, Nick Bostrom has an example where a Riemann hypothesis solving machine converts the earth into computronium. I imagine he’d predict the same for a chess program, regardless of what ontology it uses.
Second, if instrumental convergence was that easy to solve (the convergence in the domain of chess is harmless), it wouldn’t really be an interesting problem.
Let’s go a little meta.
It seems clear that an agent that “maximizes utility” exhibits instrumental convergence. I think we can state a stronger claim: any agent that “plans to reach imagined futures”, with some implicit “preferences over futures”, exhibits instrumental convergence.
The question then is how much can you weaken the constraint “looks like a utility maximizer”, before instrumental convergence breaks? Where is the point in between “formless program” and “selects preferred imagined futures” at which instrumental convergence starts/stops applying?
---
This moves in the direction of working out exactly which components of utility-maximizing behaviour are necessary. (Personally, I think you might only need to assume “backchaining”.)
So, I’m curious: What do you think a minimal set of necessary pieces might be, before a program is close enough to “goal directed” for instrumental convergence to apply?
This might be a difficult question to answer, but it’s probably a good way to understand why instrumental convergence feels so real to other people.
Actually, I don’t think this is true. For example take a chess playing program which imagines winning, and searches for strategies to reach that goal. Instrumental convergence would assume that the program would resist being turned off, try to get more computational resources, or try to drug/hack the opponent to make them weaker. However, the planning process could easily be restricted to chess moves, where none of these would be found, and thus would not exhibit instrumental convergence. This sort of “safety measure” isn’t very reliable, especially when we’re dealing with the real world rather than a game. However it is possible for an agent to be a utility maximizer, or to have some utility maximizing subroutines, and still not exhibit instrumental convergence.
There is another, more philosophical question of what is and isn’t a preference over futures. I believe that there can be a brilliant chess player that does not actually prefer winning to losing. But the relevant terms are vague enough that it’s a pain to think about them.
I think proponents of the instrumental convergence thesis would expect a consequentialist chess program to exhibit instrumental convergence in the domain of chess. So if there were some (chess-related) subplan that was useful in lots of other (chess-related) plans, we would see the program execute that subplan a lot. The important difference would be that the chess program uses an ontology of chess while unsafe programs use an ontology of nature.
First, Nick Bostrom has an example where a Riemann hypothesis solving machine converts the earth into computronium. I imagine he’d predict the same for a chess program, regardless of what ontology it uses.
Second, if instrumental convergence was that easy to solve (the convergence in the domain of chess is harmless), it wouldn’t really be an interesting problem.