DanielFilan asked me to comment on this paragraph:
Yet, even in the relatively formal world of machine learning, the practice seems contrary to this. When you are optimizing a neural network, you don’t actually care that much whether it’s something like a hypothesis (making predictions) or something like a policy (carrying out actions). You apply the same kind of regularization either way, as far as I understand (regularization being the machine-learner’s generalization of Occam).
AFAIK, it’s not actually standard to regularize RL policies the same way you regularize supervised learning. For example, A3C, PPO, and SAC, three leading Deep RL algorithms, use an entropy bonus to regularize their policies. Notably, entropy encourages policies that do different things, instead of policies that are internally simple. On the other hand, in supervised learning, people use techniques such as L2 regularization and Dropout, to get predictors that are simple.
You do see L2 regularization used in a lot of deep RL papers (for example, it’s used on every network in AlphaZero variants, in DQN, and even in earlier versions of the SAC algorithm I mentioned before). However, it’s important to note that L2 regularization is used on prediction tasks:
The policy and value network try to predict the MCTS-amplified policy and value, respectively.
The L2-regularized networks in DQN and SAC are used to predict the Q-values.
(Vanessa’s argument about PSRL also seems similar, as PSRL is fundamentally doing supervised learning.)
As for the actual question, I’m not sure “instrumental Occam” exists. Absent multi-agent issues, my guess is Occam’s razor is useful in RL insofar as your algorithm has predictive tasks. You want a simple rule for predicting the reward, given your action, not a simple rule for predicting action given an observation history. Insofar as an actual simplicity prior on policies exist and are useful, my guess is that it’s because your AI might interact with other AIs (including copies of itself), and so need to be legible/inexploitable/predictable/etc.
DanielFilan asked me to comment on this paragraph:
AFAIK, it’s not actually standard to regularize RL policies the same way you regularize supervised learning. For example, A3C, PPO, and SAC, three leading Deep RL algorithms, use an entropy bonus to regularize their policies. Notably, entropy encourages policies that do different things, instead of policies that are internally simple. On the other hand, in supervised learning, people use techniques such as L2 regularization and Dropout, to get predictors that are simple.
You do see L2 regularization used in a lot of deep RL papers (for example, it’s used on every network in AlphaZero variants, in DQN, and even in earlier versions of the SAC algorithm I mentioned before). However, it’s important to note that L2 regularization is used on prediction tasks:
The policy and value network try to predict the MCTS-amplified policy and value, respectively.
The L2-regularized networks in DQN and SAC are used to predict the Q-values.
(Vanessa’s argument about PSRL also seems similar, as PSRL is fundamentally doing supervised learning.)
As for the actual question, I’m not sure “instrumental Occam” exists. Absent multi-agent issues, my guess is Occam’s razor is useful in RL insofar as your algorithm has predictive tasks. You want a simple rule for predicting the reward, given your action, not a simple rule for predicting action given an observation history. Insofar as an actual simplicity prior on policies exist and are useful, my guess is that it’s because your AI might interact with other AIs (including copies of itself), and so need to be legible/inexploitable/predictable/etc.
Excellent, thanks for the comment! I really appreciate the correction. That’s quite interesting.