I was waiting for one of the DM researchers to answer the question about reward shaping and Oriol Vinyals just said:
Most agents get rewarded for win/loss, without discount (i.e., they don’t care to play long games). Some, however, use rewards such as the agent that “liked” to build disruptors.
[...]
Yes. Supervised learning makes agents play more or less reasonably. RL can then figure out what it means to win / be good at the game.
If you win, you get a reward of 1. If you win, and build 1 disruptor at least, you get a reward of 2.
Is anyone else surprised by how little reward shaping/engineering was needed here? Did DM use some other tricks to help the agents learn from a relatively sparse reward signal, or was it just a numbers game (if you train the agents enough, even a sparse signal would be enough)?
I was waiting for one of the DM researchers to answer the question about reward shaping and Oriol Vinyals just said:
Is anyone else surprised by how little reward shaping/engineering was needed here? Did DM use some other tricks to help the agents learn from a relatively sparse reward signal, or was it just a numbers game (if you train the agents enough, even a sparse signal would be enough)?