Executing a policy trained through current reinforcement learning methods does not necessarily result in a system which takes actions to maximize the reward function.
I am not convinced that this is actually a narrowing of what TurnTrout said.
Consider the following possible claims:
An AI will not maximize something similar to the original reward function it is trained on.
An AI will not maximize the numerical value it receives from its reward module.
combined with one of:
A) either of 1 or 2 but, to clarify, what we mean by maximizing a target is specifically something like agentically seeking/valuing the target in some fairly explicit manner.
B) either 1 or 2 but, to clarify, what we mean by maximizing a target is acting in a way that looks like it is maximizing the target in question (as opposed to maximizing something else or not maximizing anything in particular), without it necessarily being an explicit goal/value.
combined with one of:
i) any of the above combinations, but to clarify, we are talking about current AI and not making general claims about scaled up AI.
ii) any of the above combinations, but to clarify, we are making an assertion that we expect to hold no matter how much the AI is scaled up.
I interpret TurnTrout’s as mainly saying 2(B)(ii) (e.g. the AI will not, in general, rewrite its reward module to output MAX_INT regardless of how smart it becomes—I agree with this point).
I also think TurnTrout is probably additionally saying 1(A)(ii) (i.e., the AI won’t explicitly value or agentically seek to maximize its original reward function no matter how much it is scaled up—this is plausible to me, but I am not as sure I agree with this as compared to 2(B)(ii)).
I interpret you, in the quote above, as maybe saying 1(A)(i), (i.e. current AIs don’t explicitly value or agentically seek to maximize the reward function on which they are trained on). While I agree, and this is weaker than 1(A)(ii) which seems to me a secondary point of TurnTrout’s post, I don’t think it is strictly speaking narrower than 2(B)(ii) which I think was TurnTrout’s main point.
Also, regarding your thought experiment—of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn’t possible. I also think that the fact that a human has pre-existing values, while the AI doesn’t, makes the thought experiment not that useful an analogy.
Hmm, I’m not sure anyone is “making an assertion that we expect to hold no matter how much the AI is scaled up.”, unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
But you’re probably right that my claim is not strictly a narrowing of the original. FWIW, I think both your (1) and (2) above are pretty likely when talking about current and near-future systems, as they scale to human levels of capability and agency, but not necessarily beyond.
I read the original post as talking mainly about current methods for RL, applied to future systems, though TurnTrout and I probably disagree on when it makes sense to start calling a system an “RL agent”.
Also, regarding your thought experiment—of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn’t possible.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
Hmm, I’m not sure anyone is “making an assertion that we expect to hold no matter how much the AI is scaled up.”, unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout’s main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn’t hold.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
I wouldn’t be surprised.
But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.
The thing is, it’s not trying to maximize the reward! (Back to TurnTrout’s point again). It’s gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn’t in the same gradient descent attractor basin.
Even if it does develop goals and values, they will be shaped by the attractor basin that it’s actually in, and not by other attractor basins.
A human with pre-existing goals is a different matter—that’s why I questioned the relevance of the thought experiment.
I am not convinced that this is actually a narrowing of what TurnTrout said.
Consider the following possible claims:
An AI will not maximize something similar to the original reward function it is trained on.
An AI will not maximize the numerical value it receives from its reward module.
combined with one of:
A) either of 1 or 2 but, to clarify, what we mean by maximizing a target is specifically something like agentically seeking/valuing the target in some fairly explicit manner.
B) either 1 or 2 but, to clarify, what we mean by maximizing a target is acting in a way that looks like it is maximizing the target in question (as opposed to maximizing something else or not maximizing anything in particular), without it necessarily being an explicit goal/value.
combined with one of:
i) any of the above combinations, but to clarify, we are talking about current AI and not making general claims about scaled up AI.
ii) any of the above combinations, but to clarify, we are making an assertion that we expect to hold no matter how much the AI is scaled up.
I interpret TurnTrout’s as mainly saying 2(B)(ii) (e.g. the AI will not, in general, rewrite its reward module to output MAX_INT regardless of how smart it becomes—I agree with this point).
I also think TurnTrout is probably additionally saying 1(A)(ii) (i.e., the AI won’t explicitly value or agentically seek to maximize its original reward function no matter how much it is scaled up—this is plausible to me, but I am not as sure I agree with this as compared to 2(B)(ii)).
I interpret you, in the quote above, as maybe saying 1(A)(i), (i.e. current AIs don’t explicitly value or agentically seek to maximize the reward function on which they are trained on). While I agree, and this is weaker than 1(A)(ii) which seems to me a secondary point of TurnTrout’s post, I don’t think it is strictly speaking narrower than 2(B)(ii) which I think was TurnTrout’s main point.
Also, regarding your thought experiment—of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn’t possible. I also think that the fact that a human has pre-existing values, while the AI doesn’t, makes the thought experiment not that useful an analogy.
Hmm, I’m not sure anyone is “making an assertion that we expect to hold no matter how much the AI is scaled up.”, unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
But you’re probably right that my claim is not strictly a narrowing of the original. FWIW, I think both your (1) and (2) above are pretty likely when talking about current and near-future systems, as they scale to human levels of capability and agency, but not necessarily beyond.
I read the original post as talking mainly about current methods for RL, applied to future systems, though TurnTrout and I probably disagree on when it makes sense to start calling a system an “RL agent”.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout’s main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn’t hold.
I wouldn’t be surprised.
But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.
The thing is, it’s not trying to maximize the reward! (Back to TurnTrout’s point again). It’s gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn’t in the same gradient descent attractor basin.
Even if it does develop goals and values, they will be shaped by the attractor basin that it’s actually in, and not by other attractor basins.
A human with pre-existing goals is a different matter—that’s why I questioned the relevance of the thought experiment.