I think if anything’s allowed it to learn more diverse tasks, it’s the attentional layers that have gotten thrown in at the recurrent step (though I haven’t actually read beyond the blog post, so I don’t know what I’m talking about). In which case it seems like it’s a question of how much data and compute you want to throw at the problem. But I’ll edit this after I read the paper and aren’t just making crazy talk.
I think if anything’s allowed it to learn more diverse tasks, it’s the attentional layers that have gotten thrown in at the recurrent step (though I haven’t actually read beyond the blog post, so I don’t know what I’m talking about). In which case it seems like it’s a question of how much data and compute you want to throw at the problem. But I’ll edit this after I read the paper and aren’t just making crazy talk.