I wish Eliezer had been clearer on why we can’t produce an AI that internalises human morality with gradient descent. I agree gradient descent is not the same as a combination of evolutionary learning + within lifetime learning, but it wasn’t clear to me why this meant that no combination of training schedule and/or bias could produce something similar.
There are probably just a few MB (wouldn’t be surprised if it could be compressed into much less) of information which sets up the brain wiring. Somewhere within that information are the structures/biases that, when exposed to the training data of being a human in our world, gives us our altruism (and much else). It’s a hard problem to understand these altruism-forming structures (which are not likely to be distinct things), replicate them in silica and make them robust even to large power differentials.
On the other hand, the human brain presumably has lots of wiring that pushes it towards selfishness and agenthood which we can hopefully just not replicate.
Either way, it seems that they could in theory be instantiated by the right process of trial and error—the question being whether the error (or misuse) gets us first.
Eliezer expects selfishness not to require any wiring once you select for a certain level of capability, meaning there’s no permanent buffer to be gained by not implementing selfishness. The margin for error in this model is thus small, and very hard for us to find without perfect understanding or some huge process.
I agree with this argument for some unknown threshold of capability but it seems strange to phrase it as impossibility unless you’re certain that the threshold is low, and even then it’s a big smuggled assumption.
EDIT: Looking back on this comment, I guess it comes down to the crux that for systems powerful enough to be relevant to alignment, by virtue of their power or research capability, must be doing strong enough optimisation on some function that we should model them as agents acting to further that goal.
I wish Eliezer had been clearer on why we can’t produce an AI that internalises human morality with gradient descent. I agree gradient descent is not the same as a combination of evolutionary learning + within lifetime learning, but it wasn’t clear to me why this meant that no combination of training schedule and/or bias could produce something similar.
Yeah agreed, this doesn’t make sense to me.
There are probably just a few MB (wouldn’t be surprised if it could be compressed into much less) of information which sets up the brain wiring. Somewhere within that information are the structures/biases that, when exposed to the training data of being a human in our world, gives us our altruism (and much else). It’s a hard problem to understand these altruism-forming structures (which are not likely to be distinct things), replicate them in silica and make them robust even to large power differentials.
On the other hand, the human brain presumably has lots of wiring that pushes it towards selfishness and agenthood which we can hopefully just not replicate.
Either way, it seems that they could in theory be instantiated by the right process of trial and error—the question being whether the error (or misuse) gets us first.
Eliezer expects selfishness not to require any wiring once you select for a certain level of capability, meaning there’s no permanent buffer to be gained by not implementing selfishness. The margin for error in this model is thus small, and very hard for us to find without perfect understanding or some huge process.
I agree with this argument for some unknown threshold of capability but it seems strange to phrase it as impossibility unless you’re certain that the threshold is low, and even then it’s a big smuggled assumption.
EDIT: Looking back on this comment, I guess it comes down to the crux that for systems powerful enough to be relevant to alignment, by virtue of their power or research capability, must be doing strong enough optimisation on some function that we should model them as agents acting to further that goal.