I don’t think there is a general answer here. But here are a couple of considerations: - networks can get stuck in local optima, so if you initialize it to memorize, it might never find a general solution. - grokking has shown that with high weight regularization, networks can transition from memorized to general solutions, so it is possible to move from one to the other. - it probably depends a bit on how exactly you initialize the memorized solution. You can represent lookup tables in different ways and some are much more liked by NNs than others. For example, I found that networks really don’t like it if you set the weights to one-hot vectors such that one input only maps to one feature. - My prediction for empirical experiments here would be something like “it might work in some cases but not be clearly better in the general case. It will also depend on a lot of annoying factors like weight decay and learning rate and the exact way you build the dictionary”.
I don’t think there is a general answer here. But here are a couple of considerations:
- networks can get stuck in local optima, so if you initialize it to memorize, it might never find a general solution.
- grokking has shown that with high weight regularization, networks can transition from memorized to general solutions, so it is possible to move from one to the other.
- it probably depends a bit on how exactly you initialize the memorized solution. You can represent lookup tables in different ways and some are much more liked by NNs than others. For example, I found that networks really don’t like it if you set the weights to one-hot vectors such that one input only maps to one feature.
- My prediction for empirical experiments here would be something like “it might work in some cases but not be clearly better in the general case. It will also depend on a lot of annoying factors like weight decay and learning rate and the exact way you build the dictionary”.