Depends on what you want to do. Look at “dynamic evaluation” (bibliography) for something with a learning rate which is not using an external memory like neural cache etc.
I’m mostly just curious about how difficult it is for a transformer to learn to effectively access information from recent backprops, without using outside structures. Can it pull an essay title? General topic? And how well does this work for stochastic vs. batch processing? Thanks a lot btw.
Depends on what you want to do. Look at “dynamic evaluation” (bibliography) for something with a learning rate which is not using an external memory like neural cache etc.
I’m mostly just curious about how difficult it is for a transformer to learn to effectively access information from recent backprops, without using outside structures. Can it pull an essay title? General topic? And how well does this work for stochastic vs. batch processing? Thanks a lot btw.