I think you have a good point, in that the VNM utility theorem is often overused/abused: I don’t think it’s clear how to frame a potentially self modifying agent in reality as a preference ordering on lotteries, and even if you could in theory do so it might require such a granular set of outcomes as to make the resulting utility function not super interesting. (I’d very much appreciate arguments for taking VNM more seriously in this context; I’ve been pretty frustrated about this.)
That said, I think instrumental convergence is indeed a problem for real world searches; the things we’re classifying as “instrumentally convergent goals” are just “things that are helpful for a large class of problems.” It turns out there are ways to do better than random search in general, and that some these ways (the most general ones) are making use of the things we’re calling “instrumentally convergent goals”: AlphaGo Zero was not a (uniformish) random search on Go programs, and humans were not a (uniformish) random search on creatures. So I don’t think this particular line of thought should make you think potential AI is less of a problem.
AlphaGo Zero was not a (uniformish) random search on Go programs, and humans were not a (uniformish) random search on creatures.
I’d classify both of those as random programs though. AlphaZero is a random program from the set of programs that are good at playing go (and that satisfy some structure set by the creators). Humans are random machines from the set of machines that are good at not dying. The searches aren’t uniform, of course, but they are not intentional enough for it to matter.
In particular, AlphaZero was not selected in such way that exhibiting instrumental convergence would benefit it, and therefore it most likely does not exhibit instrumental convergence. Suppose there was a random modification to AlphaZero that would make it try to get more computational resources, and that this modification was actually made during training. The modified version would play against the original, the modification would not actually help it win in the simulated environment, the modified version would most likely lose and be discarded. If the modified version did end up winning, then it was purely by chance.
The case of humans is more complicated, since the “training” does reward self preservation. Curiously, this self preservation seems to be it’s own goal, and not a subgoal of some other desire, as instrumental convergence would predict. Also, human self preservation only works in a narrow sense. You run from a tiger, but you don’t always appreciate long term and low probability threats, presumably because you were not selected to appreciate them. I suspect that concern for these non-urgent threats does not correlate strongly with IQ, unlike what instrumental convergence would predict.
I was definitely very confused when writing the part you quoted. I think the underlying thought was that the processes of writing humans and of writing AlphaZero are very non-random; i.e., even if there’s a random number generated in some sense somewhere as part of the process, there’s other things going on that are highly constraining the search space—and those processes are making use of “instrumental convergence” (stored resources, intelligence, putting the hard drives in safe locations.) Then I can understand your claim as “instrumental convergence may occur in guiding the search for/construction of an agent, but there’s no reason to believe that agent will then do instrumentally convergent things.” I think that’s not true in general, but it would take more words to defend.
I think you have a good point, in that the VNM utility theorem is often overused/abused: I don’t think it’s clear how to frame a potentially self modifying agent in reality as a preference ordering on lotteries, and even if you could in theory do so it might require such a granular set of outcomes as to make the resulting utility function not super interesting. (I’d very much appreciate arguments for taking VNM more seriously in this context; I’ve been pretty frustrated about this.)
That said, I think instrumental convergence is indeed a problem for real world searches; the things we’re classifying as “instrumentally convergent goals” are just “things that are helpful for a large class of problems.” It turns out there are ways to do better than random search in general, and that some these ways (the most general ones) are making use of the things we’re calling “instrumentally convergent goals”: AlphaGo Zero was not a (uniformish) random search on Go programs, and humans were not a (uniformish) random search on creatures. So I don’t think this particular line of thought should make you think potential AI is less of a problem.
I’d classify both of those as random programs though. AlphaZero is a random program from the set of programs that are good at playing go (and that satisfy some structure set by the creators). Humans are random machines from the set of machines that are good at not dying. The searches aren’t uniform, of course, but they are not intentional enough for it to matter.
In particular, AlphaZero was not selected in such way that exhibiting instrumental convergence would benefit it, and therefore it most likely does not exhibit instrumental convergence. Suppose there was a random modification to AlphaZero that would make it try to get more computational resources, and that this modification was actually made during training. The modified version would play against the original, the modification would not actually help it win in the simulated environment, the modified version would most likely lose and be discarded. If the modified version did end up winning, then it was purely by chance.
The case of humans is more complicated, since the “training” does reward self preservation. Curiously, this self preservation seems to be it’s own goal, and not a subgoal of some other desire, as instrumental convergence would predict. Also, human self preservation only works in a narrow sense. You run from a tiger, but you don’t always appreciate long term and low probability threats, presumably because you were not selected to appreciate them. I suspect that concern for these non-urgent threats does not correlate strongly with IQ, unlike what instrumental convergence would predict.
I was definitely very confused when writing the part you quoted. I think the underlying thought was that the processes of writing humans and of writing AlphaZero are very non-random; i.e., even if there’s a random number generated in some sense somewhere as part of the process, there’s other things going on that are highly constraining the search space—and those processes are making use of “instrumental convergence” (stored resources, intelligence, putting the hard drives in safe locations.) Then I can understand your claim as “instrumental convergence may occur in guiding the search for/construction of an agent, but there’s no reason to believe that agent will then do instrumentally convergent things.” I think that’s not true in general, but it would take more words to defend.