As collisions with the boundary happen exactly when one action’s probability hits zero, it seems the resulting policies are quite large-support, hence quite probabilistic, which might be a problem in itself, making the agent unpredictable. What is your thinking about this?
Related to 2., it seems that while your algorithm ensures that expected true return cannot decrease, it might still lead to quite low true returns in individual runs. So do you agree that this type of algorithm is rather a safety ingredient amongst other ingredients, rather that meant to be a sufficient solution to satety?
Excellent! I have three questions
How would we get to a certain upper bound on θ?
As collisions with the boundary happen exactly when one action’s probability hits zero, it seems the resulting policies are quite large-support, hence quite probabilistic, which might be a problem in itself, making the agent unpredictable. What is your thinking about this?
Related to 2., it seems that while your algorithm ensures that expected true return cannot decrease, it might still lead to quite low true returns in individual runs. So do you agree that this type of algorithm is rather a safety ingredient amongst other ingredients, rather that meant to be a sufficient solution to satety?