A transformer at temp 0 is also doing an argmax. I’m not sure what the fundamental difference is—maybe that there’s a simple and unchanging evaluation function for direct optimisers?
Alternatively, we could say that the class of approximators all differ substantially in practice from direct optimisation algorithms. I feel like that needs to be substantiated, however. It is, after all, possible to learn a standard direct optimisation algorithm from data. You could construct a silly learner that can implement either the direct optimisation algorithm or something else random, and then have it pick whichever performs better on the data. It might also be possible with less silly learners.
A transformer at temp 0 is also doing an argmax. I’m not sure what the fundamental difference is—maybe that there’s a simple and unchanging evaluation function for direct optimisers?
Alternatively, we could say that the class of approximators all differ substantially in practice from direct optimisation algorithms. I feel like that needs to be substantiated, however. It is, after all, possible to learn a standard direct optimisation algorithm from data. You could construct a silly learner that can implement either the direct optimisation algorithm or something else random, and then have it pick whichever performs better on the data. It might also be possible with less silly learners.