“And not only do I not expect the trained agents to not maximize the original “outer” reward signal”
Nitpick: one “not” too many?
Thanks, fixed.
“And not only do I not expect the trained agents to not maximize the original “outer” reward signal”
Nitpick: one “not” too many?
Thanks, fixed.