Thanks for writing the report—I think it’s an important issue, and you’ve clearly gone to a lot of effort. Overall, I think it’s good.
However, it seems to me that the “incentivized episode” concept is confused, and that some conclusions over the probability of beyond-episode goals are invalid on this basis. I’m fairly sure I’m not confused here (though I’m somewhat confused that no-one picked this up in a draft, so who knows?!).
I’m not sure to what extent the below will change your broader conclusions—if it only moves you from [this can’t happen, by definition], to [I expect this to be rare], the implications may be slight. It does seem to necessitate further thought—and different assumptions in empirical investigations.
The below can all be considered as [this is how things seem to me], so I’ll drop caveats along those lines. I’m largely making the same point throughout—I’m hoping the redundancy will increase clarity.
“The unit of time such that we can know by definition that training is not directly pressuring the model to care about consequences beyond that time.”—this is not a useful definition, since there is no unit of time where we know this by definition. We know only that the model is not pressured by consequences beyond training.
For this reason, in what follows I’ll largely assume that [episode = all of training]. Sticking with the above definition gives unbounded episodes, and that doesn’t seem in the spirit of things.
Replacing “incentivized episode” with “incentivizing episode” would help quite a bit. This clarifies the causal direction we know by definition. It’s a mistake to think that we can know what’s incentivized “by definition”: we know what incentivizes.
In particular, only the episode incentivizes; we expect [caring about the episode] to be incentivized; we shouldn’t expect [caring about the episode] to be the onlything incentivized (at least not by definition).
For instance, for any t, we can put our best attempt at [care about consequences beyond t] in the reward function (perhaps we reward behaviour most likely to indicate post-t caring; perhaps we use interpretability on weights and activations, and we reward patterns that indicate post-t caring—suppose for now that we have superb interpretability tools).
The interpretability case is more direct than rewarding future consequences: it rewards the mechanisms that constitute caring/goals directly.
If a process we’ve designed explicitly to train for x, is described as “not directly pressuring the model to x”by definition, then our definition is at least misleading—I’d say broken.
Of course this example is contrived—but it illustrates the problem. There may be analogous implicit (but direct!) effects that are harder to notice. We can’t rule these out with a definition.
These need not be down to inductive bias: as here, they can be favoured as a consequence of the reward function (though usually not so explicitly and extremely as in this example).
What matters is the robustness of correlations, not directness.
Most directness is in our map, not in the territory. I.e. it’s largely saying [this incentive is obvious to us], [this consequence is obvious to us] etc.
The episode is the direct causeof gradient updates.
As a consequenceof gradient updates, behaviour on the episode is no more/less direct than behaviour outside that interval.
While a strict constraint on beyond-episode goals may be rare, pressure will not be.
The former would require that some beyond-episode goal is entailed by high performance in-episode.
The latter only requires that some beyond-episode goal is more likely given high performance in-episode.
This is the kind of pressure that might, in principle, be overcome by adversarial training—but it remains direct pressure for beyond-episode goals resulting from in-episode consequences.
The most significant invalid conclusion (Page 52):
“Why might you not expect naturally-arising beyond-episode goals? The most salient reason, to me, is that by definition, the gradients given in training (a) do not directly pressure the model to have them…”
As a general “by definition” claim, this is false (or just silly, if we say that the incentivized episode is unbounded, and there is no “beyond-episode”).
It’s less clear to me how often I’d expect active pressure towards beyond-episode-goals over strictly-in-episode-goals (other than through the former being more common/simple). It is clear that such pressure is possible.
Directness is not the issue: if I keep x small, I keep x2 small. That I happen to be manipulating x directly, rather than x2 directly, is immaterial.
Again, the out-of-episode consequences of gradient updates are just as direct as the in-episode consequences—they’re simply harder for us to predict.
Thanks for writing the report—I think it’s an important issue, and you’ve clearly gone to a lot of effort. Overall, I think it’s good.
However, it seems to me that the “incentivized episode” concept is confused, and that some conclusions over the probability of beyond-episode goals are invalid on this basis. I’m fairly sure I’m not confused here (though I’m somewhat confused that no-one picked this up in a draft, so who knows?!).
I’m not sure to what extent the below will change your broader conclusions—if it only moves you from [this can’t happen, by definition], to [I expect this to be rare], the implications may be slight. It does seem to necessitate further thought—and different assumptions in empirical investigations.
The below can all be considered as [this is how things seem to me], so I’ll drop caveats along those lines. I’m largely making the same point throughout—I’m hoping the redundancy will increase clarity.
“The unit of time such that we can know by definition that training is not directly pressuring the model to care about consequences beyond that time.”—this is not a useful definition, since there is no unit of time where we know this by definition. We know only that the model is not pressured by consequences beyond training.
For this reason, in what follows I’ll largely assume that [episode = all of training]. Sticking with the above definition gives unbounded episodes, and that doesn’t seem in the spirit of things.
Replacing “incentivized episode” with “incentivizing episode” would help quite a bit. This clarifies the causal direction we know by definition. It’s a mistake to think that we can know what’s incentivized “by definition”: we know what incentivizes.
In particular, only the episode incentivizes; we expect [caring about the episode] to be incentivized; we shouldn’t expect [caring about the episode] to be the only thing incentivized (at least not by definition).
For instance, for any t, we can put our best attempt at [care about consequences beyond t] in the reward function (perhaps we reward behaviour most likely to indicate post-t caring; perhaps we use interpretability on weights and activations, and we reward patterns that indicate post-t caring—suppose for now that we have superb interpretability tools).
The interpretability case is more direct than rewarding future consequences: it rewards the mechanisms that constitute caring/goals directly.
If a process we’ve designed explicitly to train for x, is described as “not directly pressuring the model to x” by definition, then our definition is at least misleading—I’d say broken.
Of course this example is contrived—but it illustrates the problem. There may be analogous implicit (but direct!) effects that are harder to notice. We can’t rule these out with a definition.
These need not be down to inductive bias: as here, they can be favoured as a consequence of the reward function (though usually not so explicitly and extremely as in this example).
What matters is the robustness of correlations, not directness.
Most directness is in our map, not in the territory. I.e. it’s largely saying [this incentive is obvious to us], [this consequence is obvious to us] etc.
The episode is the direct cause of gradient updates.
As a consequence of gradient updates, behaviour on the episode is no more/less direct than behaviour outside that interval.
While a strict constraint on beyond-episode goals may be rare, pressure will not be.
The former would require that some beyond-episode goal is entailed by high performance in-episode.
The latter only requires that some beyond-episode goal is more likely given high performance in-episode.
This is the kind of pressure that might, in principle, be overcome by adversarial training—but it remains direct pressure for beyond-episode goals resulting from in-episode consequences.
The most significant invalid conclusion (Page 52):
“Why might you not expect naturally-arising beyond-episode goals? The most salient reason, to me, is that by definition, the gradients given in training (a) do not directly pressure the model to have them…”
As a general “by definition” claim, this is false (or just silly, if we say that the incentivized episode is unbounded, and there is no “beyond-episode”).
It’s less clear to me how often I’d expect active pressure towards beyond-episode-goals over strictly-in-episode-goals (other than through the former being more common/simple). It is clear that such pressure is possible.
Directness is not the issue: if I keep x small, I keep x2 small. That I happen to be manipulating x directly, rather than x2 directly, is immaterial.
Again, the out-of-episode consequences of gradient updates are just as direct as the in-episode consequences—they’re simply harder for us to predict.