Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.
We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.
I guess the shutdown timer would be most important in the training stage, so that it (hopefully) learns only to care about the short term.
Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.
We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.