(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)
(In fact, we know that the fraction of features extracted is probably quite small—for example, the 16M latent GPT-4 autoencoder only captures 10% of the downstream loss in terms of equivalent pretraining compute.)