Hmm, I’m not sure anyone is “making an assertion that we expect to hold no matter how much the AI is scaled up.”, unless scaling up means something pretty narrow like applying current RL algorithms to larger and larger networks and more and more data.
While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout’s main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn’t hold.
As someone who has worked in computer security, and also written and read a lot of Python code, my guess is that cheating at current RL training processes as actually implemented is very, very possible for roughly human-level agents. (That was the other point of my post on gradient hacking.)
I wouldn’t be surprised.
But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.
The thing is, it’s not trying to maximize the reward! (Back to TurnTrout’s point again). It’s gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn’t in the same gradient descent attractor basin.
Even if it does develop goals and values, they will be shaped by the attractor basin that it’s actually in, and not by other attractor basins.
A human with pre-existing goals is a different matter—that’s why I questioned the relevance of the thought experiment.
While I did intend (ii) to mean something relatively narrow like that, I will make the assertion that I expect 2(B)(ii) (which I think was TurnTrout’s main point) to hold for a large class of algorithms, not just current ones, and that it would require a major screw-up for someone to implement an algorithm for which it didn’t hold.
I wouldn’t be surprised.
But, I would be surprised if it actually did cheat, unless the hacking were not merely possible with planning but pretty much laid out on its path.
The thing is, it’s not trying to maximize the reward! (Back to TurnTrout’s point again). It’s gradient descent-ing in some attractor basin towards cognitive strategies that get good rewards in practice, and the hacking probably isn’t in the same gradient descent attractor basin.
Even if it does develop goals and values, they will be shaped by the attractor basin that it’s actually in, and not by other attractor basins.
A human with pre-existing goals is a different matter—that’s why I questioned the relevance of the thought experiment.