Paul’s post offers two conditions about the ease of training an acceptable model (in particular, that it should not stop the agent achieving a high average reward and that is shouldn’t make hard problems much harder), but Evan’s conditions are about the ease of choosing an acceptable action.
This is reversed. Paul’s conditions were about the ease of choosing an acceptable action; my conditions are about the ease of training an acceptable model.
This is reversed. Paul’s conditions were about the ease of choosing an acceptable action; my conditions are about the ease of training an acceptable model.
Oops. Edited.