I broadly agree with a lot of shard theory claims. However, the important thing to realise is that ‘human values’ do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI’s ‘linguistic cortex’ which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us ‘truly want’ specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model—is necessary in situations where a.) you can’t behaviourally clone existing behaviour and b.) you can’t self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function.
I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model
the important thing to realise is that ‘human values’ do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.
Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
Can you give me some examples here? I don’t know that I follow what you’re pointing at.
I broadly agree with a lot of shard theory claims. However, the important thing to realise is that ‘human values’ do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.
With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI’s ‘linguistic cortex’ which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us ‘truly want’ specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.
Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model—is necessary in situations where a.) you can’t behaviourally clone existing behaviour and b.) you can’t self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function.
I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function.
FWIW I don’t consider myself to be arguing against planning over a world model.
Can you give me some examples here? I don’t know that I follow what you’re pointing at.