You can get fast feedback by reusing existing databases if your RL agent can do off-policy learning. (You can consider this what the supervised pre-learning phase is ‘really’ doing.) Your agent doesn’t have to take an action before it can learn from it. Consider the experience replay buffers. You could imagine a song-writing RL agent which has a huge experience replay buffer which is made just of fragments of songs you grabbed online (say, from the Touhou megatorrent with its 50k tracks).
You can get fast feedback by reusing existing databases if your RL agent can do off-policy learning. (You can consider this what the supervised pre-learning phase is ‘really’ doing.) Your agent doesn’t have to take an action before it can learn from it. Consider the experience replay buffers. You could imagine a song-writing RL agent which has a huge experience replay buffer which is made just of fragments of songs you grabbed online (say, from the Touhou megatorrent with its 50k tracks).