So, why indeed, did purely supervised learning with backpropagation not work well in the past? Geoffrey Hinton summarized the findings up to today in these four points:
Our labeled datasets were thousands of times too small.
Our computers were millions of times too slow.
We initialized the weights in a stupid way.
We used the wrong type of non-linearity.
I think this blog series might help provide a partial answer to your question.
From A ‘Brief’ History of Neural Nets and Deep Learning, Part 4:
I think this blog series might help provide a partial answer to your question.