While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead ‘maximise reward’ in the same way self-supervised models ‘minimise crossentropy’—that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for reward (or crossentropy). AIXI is incomputable but it definitely does maximise reward. MCTS algorithms also directly maximise rewards. Alpha-Go style agents contain both direct reward maximising components initialized and guided by amortised heuristics (and the heuristics are distilled from the outputs of the maximising MCTS process in a self-improving loop). I wrote about the distinction between these two kinds of approaches—direct vs amortised optimisation here. I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models.
Agree with a bunch of these points. EG in Reward is not the optimization target I noted that AIXI really does maximize reward, theoretically. I wouldn’t say that AIXI means that we have “produced” an architecture which directly optimizes for reward, because AIXI(-tl) is a bad way to spend compute. It doesn’t actually effectively optimize reward in reality.
I’d consider a model-based RL agent to be “reward-driven” if it’s effective and most of its “optimization” comes from the direct part and not the leaf-node evaluation (as in e.g. AlphaZero, which was still extremely good without the MCTS).
I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models.
“Direct” optimization has not worked—at scale—in the past. Do you think that’s going to change, and if so, why?
Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy—i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.
While I agree with a lot of points of this post, I want to quibble with the RL not maximising reward point. I agree that model-free RL algorithms like DPO do not directly maximise reward but instead ‘maximise reward’ in the same way self-supervised models ‘minimise crossentropy’—that is to say, the model is not explicitly reasoning about minimising cross entropy but learns distilled heuristics that end up resulting in policies/predictions with a good reward/crossentropy. However, it is also possible to produce architectures that do directly optimise for reward (or crossentropy). AIXI is incomputable but it definitely does maximise reward. MCTS algorithms also directly maximise rewards. Alpha-Go style agents contain both direct reward maximising components initialized and guided by amortised heuristics (and the heuristics are distilled from the outputs of the maximising MCTS process in a self-improving loop). I wrote about the distinction between these two kinds of approaches—direct vs amortised optimisation here. I think it is important to recognise this because I think that this is the way that AI systems will ultimately evolve and also where most of the danger lies vs simply scaling up pure generative models.
Agree with a bunch of these points. EG in Reward is not the optimization target I noted that AIXI really does maximize reward, theoretically. I wouldn’t say that AIXI means that we have “produced” an architecture which directly optimizes for reward, because AIXI(-tl) is a bad way to spend compute. It doesn’t actually effectively optimize reward in reality.
I’d consider a model-based RL agent to be “reward-driven” if it’s effective and most of its “optimization” comes from the direct part and not the leaf-node evaluation (as in e.g. AlphaZero, which was still extremely good without the MCTS).
“Direct” optimization has not worked—at scale—in the past. Do you think that’s going to change, and if so, why?
Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy—i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.