My intuition is it should be small in most cases, but there are some scenarios where this could be important.
Let’s imagine we are training a reinforcement learning agent AGI that discounts rewards in time by some parameter d with 0 < d < 1 (so an expected reward r that is gotten n timesteps from now is worth d*r^n at the current time step). Let’s further assume the wireheading problem has been solved (the AI can’t change the reward calculating process, and give itself, say, infinite reward), and that there is a maximum possible reward of M per time step. This maximum reward could be coded in explicitly in the reward function, or could be implicitly calculated using physical limits.
If deployment is at least b time steps away, the maximum possible deployment reward considering discounting (as calculated at the current time step) is M*d^b/(1-d). If there is any deception the AGI could do that would sacrifice its short-term reward by more than that amount, then it will not do that deception. There seems to be a bit of a tradeoff, where the bigger the multiplier is between the development reward and the training reward, the larger M becomes, and the smaller d needs to be in order to encourage the AGI to engage in non-deceptive behavior.
If we want d to be larger, which will enable an aligned AI to make useful long-term plans, this argument claims we should use a smaller multiplier between development reward and training reward in order for an unaligned AI to be incentived to not deceive us. Alternatively, this argument claims that you should wait a long time between when you test if your model is deceptive and when you deploy it—running a lot of time steps with 0 reward between your deception test and deployment will increase b and thus shrink the size of M*d^b/(1-d), which should make deception much less encouraged.
My intuition is it should be small in most cases, but there are some scenarios where this could be important.
Let’s imagine we are training a reinforcement learning agent AGI that discounts rewards in time by some parameter d with 0 < d < 1 (so an expected reward r that is gotten n timesteps from now is worth d*r^n at the current time step). Let’s further assume the wireheading problem has been solved (the AI can’t change the reward calculating process, and give itself, say, infinite reward), and that there is a maximum possible reward of M per time step. This maximum reward could be coded in explicitly in the reward function, or could be implicitly calculated using physical limits.
If deployment is at least b time steps away, the maximum possible deployment reward considering discounting (as calculated at the current time step) is M*d^b/(1-d). If there is any deception the AGI could do that would sacrifice its short-term reward by more than that amount, then it will not do that deception. There seems to be a bit of a tradeoff, where the bigger the multiplier is between the development reward and the training reward, the larger M becomes, and the smaller d needs to be in order to encourage the AGI to engage in non-deceptive behavior.
If we want d to be larger, which will enable an aligned AI to make useful long-term plans, this argument claims we should use a smaller multiplier between development reward and training reward in order for an unaligned AI to be incentived to not deceive us. Alternatively, this argument claims that you should wait a long time between when you test if your model is deceptive and when you deploy it—running a lot of time steps with 0 reward between your deception test and deployment will increase b and thus shrink the size of M*d^b/(1-d), which should make deception much less encouraged.