I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson’s recent “maybe intelligence makes things more good” type reasoning).
Here’s my replies, which seemed worth putting somewhere public:
The claims in the paper seem wrong to me as stated, and in particular seems to conflate values with instrumental subgoals. One does not need to terminally value survival to avoid getting hit by a truck while fetching coffee; they could simply understand that one can’t fetch the coffee when one is dead.
And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives:
So one hypothesis is that in practice, all practically-trainable minds manage to survive by dint of a human-esque survival instinct (while admitting that manually-engineered minds could survive some other way, e.g. by simply correctly modeling the consequences).
This mostly seems to me to be like people writing sci-fi in which the aliens are all humanoid; it is a hypothesis about tight clustering of cognitive drives even across very disparate paradigms (optimizing genomes is very different from optimizing every neuron directly).
But a deeper objection I have here is that I’d be much more comfortable with people slinging this sort of hypothesis around if they were owning the fact that it’s a hypothesis about tight clustering and non-alienness of all minds, while stating plainly that they think we should bet the universe on this intuition (despite how many times the universe has slapped us for believing anthropocentrism in the past).
FWIW, some reasons that I don’t myself buy this hypothesis include:
(a) the specifics of various human drives seem to me to be very sensitive to the particulars of our ancestry (ex: empathy seems likely a shortcut for modeling others by repurposing machinery for modeling the self (or vice versa), that is likely not found by hillclimbing when the architecture of the self is very different from the architecture of the other);
(b) my guess is that the pressures are just very different for different search processes (genetic recombination of DNA vs SGD on all weights); and
(c) it looks to me like value is fragile, such that even if the drives were kinda close, I don’t expect the obtainable optimum to be good according to our lights
(esp. given that the question is not just what drives the AI gets, but the reflective equilibrium of those drives: small changes to initial drives are allowed to have large changes to the reflective equilibrium, and I suspect this is so).
I was recently part of a group-chat where some people I largely respect were musing about this paper and this post and some of Scott Aaronson’s recent “maybe intelligence makes things more good” type reasoning).
Here’s my replies, which seemed worth putting somewhere public:
See also instrumental convergence.
And then in reply to someone pointing out that the paper was perhaps trying to argue that most minds tend to wind up with similar values because of the fact that all minds are (in some sense) rewarded in training for developing similar drives: