Haha, my immediate thought after “maximize total reward” was to wonder if I should intentionally limit my success rate (possibly slowly improving over time or even varying semi-predictably) to try to extend the run-time of the experiment. What use is 100% reward after all if the experiment ends as soon as I achieve it?
Haha, my immediate thought after “maximize total reward” was to wonder if I should intentionally limit my success rate (possibly slowly improving over time or even varying semi-predictably) to try to extend the run-time of the experiment. What use is 100% reward after all if the experiment ends as soon as I achieve it?
you’re projecting your own desire for long run-time; the AI only wants to maximize rewards
I’m maximizing total reward over the run rather than rate of reward.
ah, ok yeah i see!