what it should rationally do given the reward structure of a simple rl environment like cartpole
RL as a training method determines what the future behaviour is for the system under training, not a source for what it rationally ought to do given that system’s model of the world (if any).
Any rationality that emerges from RL training will be merely an instrumental epiphenomenon of the system being trained. A simple cartpole environment will not train it to be rational, since a vastly simpler mapping of inputs to outputs achieves the RL goal just as well or better. A pre-trained rational AGI put into a simple RL cartpole environment may well lose its rationality rather than effectively training it to use rationality to achieve the goal.
I just realized another possible confusion:
RL as a training method determines what the future behaviour is for the system under training, not a source for what it rationally ought to do given that system’s model of the world (if any).
Any rationality that emerges from RL training will be merely an instrumental epiphenomenon of the system being trained. A simple cartpole environment will not train it to be rational, since a vastly simpler mapping of inputs to outputs achieves the RL goal just as well or better. A pre-trained rational AGI put into a simple RL cartpole environment may well lose its rationality rather than effectively training it to use rationality to achieve the goal.