I think this is a really great summary. In this comment I’ll react to some bits where I’m either uncertain or disagreed with the content. But note that I was nodding along to the bulk of the post.
Disagreements about what the atomic claims are
RL training processes create actors, not graders
I don’t think I agree with this as stated. Some RL methods explicitly do train some sort of grader, for the sake of policy improvement or some other purpose. And some policy structures (like AlphaZero’s neural network-guided heuristic tree search) use a grader at runtime to help it plan a course of action. Those both seem like smart ways of using a learned grader as a tool for better decision-making. So I wouldn’t say RL training processes don’t create graders at all.
I would say that RL training processes (including value- and model-based ones) generally don’t create “grader-optimizers”, in the sense of policies whose runtime thoughts are primarily about the grader, rather than about other concepts like “putting pressure on the opponent’s queen (in chess)”. I would also say that RL doesn’t by default create agents that contain graders, and that when they do, it’s generally because the designer explicitly coded the agent so that it contained a grader.
(Naive) consequentialist reasoning might not be convergent in advanced AI
I’m not entirely sure what the claim is here, which I think is similar to how you were feeling, but I vaguely estimate that I would disagree with this when fleshed out. Like, I think that advanced AIs will probably care about outcomes, among other things. This seems especially true because shards will tend to be shaped to steer towards their historic reinforcers, and will do so most effectively when they are sensitive to the causal link between the agent’s choices and those reinforcers, which should lead them to steer on the basis of the agent’s predictions about outcomes.
There is more nuance that one could add to make the definition of shards more complex, but that I left out to not blow it up:
[...]
Additionally, not all convergently reinforced actions may ever have been useful for steering toward reward: in actor-critic policy gradient reinforcement learning, actions are strengthened if they lead to states with a higher evaluation than the state one is coming from.
This is likely more central to my picture of shard dynamics than it is in the post’s. The distinction between “steering toward reward” and “steering towards subjective value-increase” is a generator of my intuitions around what agents end up caring about and how.
Difference of opinion about the claims
insofar as reinforcement learning operates in such a way that the agent, over time, receives an ever higher reward, and insofar as adversarial actions — like wireheading — receive an extremely high reward, it is important to reason why the training process cannot possibly find them.
This is too strong IMO. It’s important to reason why the training process is very unlikely to find them, but not why the training process cannot possibly find them. Exploration is stochastic and factors strongly into how the training trajectory progresses, including what behaviors are “found” in the sense you’re describing. Since there’s a possibility we just get unlucky there and the agent happens to sample a wireheading trajectory while it is still naive, we can’t make these sorts of worst case arguments. Nor do I think it is necessarily important to do so. Solid probabilistic arguments and empirical arguments seem sufficient, if we’re talking about the phase of training when the agent is not clever enough to foresightedly seek out / avoid wireheading.
I think this is a really great summary. In this comment I’ll react to some bits where I’m either uncertain or disagreed with the content. But note that I was nodding along to the bulk of the post.
Disagreements about what the atomic claims are
I don’t think I agree with this as stated. Some RL methods explicitly do train some sort of grader, for the sake of policy improvement or some other purpose. And some policy structures (like AlphaZero’s neural network-guided heuristic tree search) use a grader at runtime to help it plan a course of action. Those both seem like smart ways of using a learned grader as a tool for better decision-making. So I wouldn’t say RL training processes don’t create graders at all.
I would say that RL training processes (including value- and model-based ones) generally don’t create “grader-optimizers”, in the sense of policies whose runtime thoughts are primarily about the grader, rather than about other concepts like “putting pressure on the opponent’s queen (in chess)”. I would also say that RL doesn’t by default create agents that contain graders, and that when they do, it’s generally because the designer explicitly coded the agent so that it contained a grader.
I’m not entirely sure what the claim is here, which I think is similar to how you were feeling, but I vaguely estimate that I would disagree with this when fleshed out. Like, I think that advanced AIs will probably care about outcomes, among other things. This seems especially true because shards will tend to be shaped to steer towards their historic reinforcers, and will do so most effectively when they are sensitive to the causal link between the agent’s choices and those reinforcers, which should lead them to steer on the basis of the agent’s predictions about outcomes.
This is likely more central to my picture of shard dynamics than it is in the post’s. The distinction between “steering toward reward” and “steering towards subjective value-increase” is a generator of my intuitions around what agents end up caring about and how.
Difference of opinion about the claims
This is too strong IMO. It’s important to reason why the training process is very unlikely to find them, but not why the training process cannot possibly find them. Exploration is stochastic and factors strongly into how the training trajectory progresses, including what behaviors are “found” in the sense you’re describing. Since there’s a possibility we just get unlucky there and the agent happens to sample a wireheading trajectory while it is still naive, we can’t make these sorts of worst case arguments. Nor do I think it is necessarily important to do so. Solid probabilistic arguments and empirical arguments seem sufficient, if we’re talking about the phase of training when the agent is not clever enough to foresightedly seek out / avoid wireheading.