Hmm. I suppose a similar key insight for my own line of research might go like:
The orthogonality thesis is actually wrong for brain-like learning systems. Such systems first learn many shallow proxies for their reward signal. Moreover, the circuits implementing these proxies are self-preserving optimization demons. They’ll steer the learning process away from the true data generating process behind the reward signal so as to ensure their own perpetuation.
If true, this insight matters a lot for value alignment because it points to a way that aligned behavior in the infra-human regime could perpetuate into the superhuman regime. If all of:
We can instil aligned behavior in the infra-human regime
The circuits that implement aligned behavior in the infra-human regime can ensure their own perpetuation into the superhuman regime
The circuits that implement aligned behavior in the infra-human regime continue to implement it in the superhuman regime
hold true, then I think we’re in a pretty good position regarding value alignment. Off-switch corrigibility is a bust though because self-preserving circuits won’t want to let you turn them off.
If you’re interested in some of the actual arguments for this thesis, you can read my answer to a question about the relation between human reward circuitry and human values.
I think this is very interesting, and closely related to a line of thinking I’ve been pursuing; stay tuned for a forthcoming post which talks about the development of shallow proxies (although I’m not thinking of it as a particularly strong reason for optimism).
Hmm. I suppose a similar key insight for my own line of research might go like:
If true, this insight matters a lot for value alignment because it points to a way that aligned behavior in the infra-human regime could perpetuate into the superhuman regime. If all of:
We can instil aligned behavior in the infra-human regime
The circuits that implement aligned behavior in the infra-human regime can ensure their own perpetuation into the superhuman regime
The circuits that implement aligned behavior in the infra-human regime continue to implement it in the superhuman regime
hold true, then I think we’re in a pretty good position regarding value alignment. Off-switch corrigibility is a bust though because self-preserving circuits won’t want to let you turn them off.
If you’re interested in some of the actual arguments for this thesis, you can read my answer to a question about the relation between human reward circuitry and human values.
I think this is very interesting, and closely related to a line of thinking I’ve been pursuing; stay tuned for a forthcoming post which talks about the development of shallow proxies (although I’m not thinking of it as a particularly strong reason for optimism).