I’m not sure how obvious it is that “ML models can act like finite automata”. I mean, there are theorems that say things like “a large enough multi-layer perceptron can approximate any function arbitrarily well”, and unless I’m being dim those do indeed indicate that for such a model there exist weights that make it implement a universal Turing machine, but I don’t think that means that e.g. such weights exist that make a transformer of “reasonable” size do that. (Though, on reflection, I think I agree that we should expect that they do.) Your comment about normal training not doing that was rather the point of my final question.
Right, I don’t know how much data a model stores, and how much of that can be reached through retraining, if all parameters can’t be specified outright. If the translation is bad enough it couldn’t quote an LLM and memorize its parameters as explicitly accessible raw data using a model of comparable size. Still, an LLM trained on actual language could probably get quite a lot smaller after some lossy compression (that I have no idea how to specify), and it would also take eons to decode from the model (by doing experiments on it to elicit its behavior). So size bounds are not the most practical concern here. But maybe the memorized data could be written down much faster with a reasonable increase in model size?
I’m not sure how obvious it is that “ML models can act like finite automata”. I mean, there are theorems that say things like “a large enough multi-layer perceptron can approximate any function arbitrarily well”, and unless I’m being dim those do indeed indicate that for such a model there exist weights that make it implement a universal Turing machine, but I don’t think that means that e.g. such weights exist that make a transformer of “reasonable” size do that. (Though, on reflection, I think I agree that we should expect that they do.) Your comment about normal training not doing that was rather the point of my final question.
Right, I don’t know how much data a model stores, and how much of that can be reached through retraining, if all parameters can’t be specified outright. If the translation is bad enough it couldn’t quote an LLM and memorize its parameters as explicitly accessible raw data using a model of comparable size. Still, an LLM trained on actual language could probably get quite a lot smaller after some lossy compression (that I have no idea how to specify), and it would also take eons to decode from the model (by doing experiments on it to elicit its behavior). So size bounds are not the most practical concern here. But maybe the memorized data could be written down much faster with a reasonable increase in model size?