Thanks for writing this up! I’ve been considering writing something in response to AI is easy to control for a while now, in particular arguing against their claim that “If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior.” I think Section 4 does a good job of explaining why this probably isn’t true, with the basic problem being that the space of behaviors consistent with the training data is larger than the space of behaviors you might “desire.”
Like, sure, if you have a mapping from synapses to desired behavior, okay—but the key word there is “desired” and at that point you’re basically just describing having solved mechanistic interpretability. In the absence of knowing exactly how synapses/weights/etc map onto the desired behavior, you have to rely on the behavior in the training set to convey the right information. But a) it’s hard to know that the desired behavior is “in” the training set in a very robust way and b) even if it were you might still run into problems like deception, not generalizing to out of distribution data, etc. Anyway, thanks for doing such a thorough write-up of it :)
Thanks for writing this up! I’ve been considering writing something in response to AI is easy to control for a while now, in particular arguing against their claim that “If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior.” I think Section 4 does a good job of explaining why this probably isn’t true, with the basic problem being that the space of behaviors consistent with the training data is larger than the space of behaviors you might “desire.”
Like, sure, if you have a mapping from synapses to desired behavior, okay—but the key word there is “desired” and at that point you’re basically just describing having solved mechanistic interpretability. In the absence of knowing exactly how synapses/weights/etc map onto the desired behavior, you have to rely on the behavior in the training set to convey the right information. But a) it’s hard to know that the desired behavior is “in” the training set in a very robust way and b) even if it were you might still run into problems like deception, not generalizing to out of distribution data, etc. Anyway, thanks for doing such a thorough write-up of it :)