Hmm, I’m not sure anyone is “making an assertion that we expect to hold no matter how much the AI is scaled up.”, unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
But you’re probably right that my claim is not strictly a narrowing of the original. FWIW, I think both your (1) and (2) above are pretty likely when talking about current and near-future systems, as they scale to human levels of capability and agency, but not necessarily beyond.
I read the original post as talking mainly about current methods for RL, applied to future systems, though TurnTrout and I probably disagree on when it makes sense to start calling a system an “RL agent”.
Also, regarding your thought experiment—of course, if in training the AI finds some way to cheat, that will be reinforced! But that has limited relevance for when cheating in training isn’t possible.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
Hmm, I’m not sure anyone is “making an assertion that we expect to hold no matter how much the AI is scaled up.”, unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout’s main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn’t hold.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
I wouldn’t be surprised.
But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.
The thing is, it’s not trying to maximize the reward! (Back to TurnTrout’s point again). It’s gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn’t in the same gradient descent attractor basin.
Even if it does develop goals and values, they will be shaped by the attractor basin that it’s actually in, and not by other attractor basins.
A human with pre-existing goals is a different matter—that’s why I questioned the relevance of the thought experiment.
Hmm, I’m not sure anyone is “making an assertion that we expect to hold no matter how much the AI is scaled up.”, unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
But you’re probably right that my claim is not strictly a narrowing of the original. FWIW, I think both your (1) and (2) above are pretty likely when talking about current and near-future systems, as they scale to human levels of capability and agency, but not necessarily beyond.
I read the original post as talking mainly about current methods for RL, applied to future systems, though TurnTrout and I probably disagree on when it makes sense to start calling a system an “RL agent”.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout’s main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn’t hold.
I wouldn’t be surprised.
But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.
The thing is, it’s not trying to maximize the reward! (Back to TurnTrout’s point again). It’s gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn’t in the same gradient descent attractor basin.
Even if it does develop goals and values, they will be shaped by the attractor basin that it’s actually in, and not by other attractor basins.
A human with pre-existing goals is a different matter—that’s why I questioned the relevance of the thought experiment.