What advantages do you think this has compared to vanilla RL on U + AUP_Penalty?
it’s also mild on the inside of the algorithm, not just in its effects on the world. this could avert problems with inner optimizers. beyond that, I haven’t thought enough about the behavior of the agent. I might reply with another comment.
What advantages do you think this has compared to vanilla RL on U + AUP_Penalty?
it’s also mild on the inside of the algorithm, not just in its effects on the world. this could avert problems with inner optimizers. beyond that, I haven’t thought enough about the behavior of the agent. I might reply with another comment.