1. Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.