I would sure be awfully surprised to see that! Wouldn’t you?
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn’t be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don’t rely on deceptive alignment. To “convert” your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is “make the relevant memory location in the RAM say that I won the game”, or “win the game in all future episodes”.
My hunch is that we don’t disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you’re misinterpreting me as saying something more interesting than I am.
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn’t be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don’t rely on deceptive alignment. To “convert” your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is “make the relevant memory location in the RAM say that I won the game”, or “win the game in all future episodes”.
My hunch is that we don’t disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you’re misinterpreting me as saying something more interesting than I am.