This seems like it would raise the incentive for AGI to be deceptive in their training environments. An un-aligned AGI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward. The benefit to the AGI of pretending to be aligned is it increases the probability of it being deployed, and thus being able to get a higher long-term reward in deployment.
Thus the bigger the discrepancy in reward an AGI would get between deployment environment and training environment, the more important the long-term deployment reward becomes, and the higher the incentive there is of the AGI being deceptive during training.
Perhaps the difference made would be small though? Feels like a relatively unlikely sort of situation, in which the AI chooses not to be deceptive but would have chosen to be deceptive if it had calculated that in the brief period before it takes over the world it would be getting 2x reward per second instead of x.
Yeah, intuitively I don’t see that as breaking alignment if it’s already aligned, and an unaligned AI would already have incentive to lie, I think. Considering the potential positive impact this could have qualia-wise, imo it’s a worthwhile practice to carry out.
My intuition is it should be small in most cases, but there are some scenarios where this could be important.
Let’s imagine we are training a reinforcement learning agent AGI that discounts rewards in time by some parameter d with 0 < d < 1 (so an expected reward r that is gotten n timesteps from now is worth d*r^n at the current time step). Let’s further assume the wireheading problem has been solved (the AI can’t change the reward calculating process, and give itself, say, infinite reward), and that there is a maximum possible reward of M per time step. This maximum reward could be coded in explicitly in the reward function, or could be implicitly calculated using physical limits.
If deployment is at least b time steps away, the maximum possible deployment reward considering discounting (as calculated at the current time step) is M*d^b/(1-d). If there is any deception the AGI could do that would sacrifice its short-term reward by more than that amount, then it will not do that deception. There seems to be a bit of a tradeoff, where the bigger the multiplier is between the development reward and the training reward, the larger M becomes, and the smaller d needs to be in order to encourage the AGI to engage in non-deceptive behavior.
If we want d to be larger, which will enable an aligned AI to make useful long-term plans, this argument claims we should use a smaller multiplier between development reward and training reward in order for an unaligned AI to be incentived to not deceive us. Alternatively, this argument claims that you should wait a long time between when you test if your model is deceptive and when you deploy it—running a lot of time steps with 0 reward between your deception test and deployment will increase b and thus shrink the size of M*d^b/(1-d), which should make deception much less encouraged.
An un-aligned AI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward.
If there is a conflict between these, that must be because the AI’s conception of reward isn’t identical to the reward that we intended. So even if we dole out higher intended reward during deployment, it’s not clear that that increases the reward that the AI expects after deployment. (But it might.)
This concern seems relevant if (1) a discount factor is used in an RL setup (otherwise the systems seems as likely to be deceptively aligned with or without the intervention, in order to eventually take over the world), and (2) a decision about whether the system is safe for deployment is made based on its behavior during training.
As an aside, the following quote from the paper seems relevant here:
Ensuring copies of the states of early potential precursor AIs are preserved to later receive benefits would permit some separation of immediate safety needs and fair compensation.
This seems like it would raise the incentive for AGI to be deceptive in their training environments. An un-aligned AGI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward. The benefit to the AGI of pretending to be aligned is it increases the probability of it being deployed, and thus being able to get a higher long-term reward in deployment.
Thus the bigger the discrepancy in reward an AGI would get between deployment environment and training environment, the more important the long-term deployment reward becomes, and the higher the incentive there is of the AGI being deceptive during training.
Perhaps the difference made would be small though? Feels like a relatively unlikely sort of situation, in which the AI chooses not to be deceptive but would have chosen to be deceptive if it had calculated that in the brief period before it takes over the world it would be getting 2x reward per second instead of x.
Yeah, intuitively I don’t see that as breaking alignment if it’s already aligned, and an unaligned AI would already have incentive to lie, I think. Considering the potential positive impact this could have qualia-wise, imo it’s a worthwhile practice to carry out.
My intuition is it should be small in most cases, but there are some scenarios where this could be important.
Let’s imagine we are training a reinforcement learning agent AGI that discounts rewards in time by some parameter d with 0 < d < 1 (so an expected reward r that is gotten n timesteps from now is worth d*r^n at the current time step). Let’s further assume the wireheading problem has been solved (the AI can’t change the reward calculating process, and give itself, say, infinite reward), and that there is a maximum possible reward of M per time step. This maximum reward could be coded in explicitly in the reward function, or could be implicitly calculated using physical limits.
If deployment is at least b time steps away, the maximum possible deployment reward considering discounting (as calculated at the current time step) is M*d^b/(1-d). If there is any deception the AGI could do that would sacrifice its short-term reward by more than that amount, then it will not do that deception. There seems to be a bit of a tradeoff, where the bigger the multiplier is between the development reward and the training reward, the larger M becomes, and the smaller d needs to be in order to encourage the AGI to engage in non-deceptive behavior.
If we want d to be larger, which will enable an aligned AI to make useful long-term plans, this argument claims we should use a smaller multiplier between development reward and training reward in order for an unaligned AI to be incentived to not deceive us. Alternatively, this argument claims that you should wait a long time between when you test if your model is deceptive and when you deploy it—running a lot of time steps with 0 reward between your deception test and deployment will increase b and thus shrink the size of M*d^b/(1-d), which should make deception much less encouraged.
If there is a conflict between these, that must be because the AI’s conception of reward isn’t identical to the reward that we intended. So even if we dole out higher intended reward during deployment, it’s not clear that that increases the reward that the AI expects after deployment. (But it might.)
This concern seems relevant if (1) a discount factor is used in an RL setup (otherwise the systems seems as likely to be deceptively aligned with or without the intervention, in order to eventually take over the world), and (2) a decision about whether the system is safe for deployment is made based on its behavior during training.
As an aside, the following quote from the paper seems relevant here: