Actually, I haven’t seen this article! Thank you very much; it seems very interesting, as do the references cited therein. However, I suppose the distribution from which “filler tokens” (or extra tokens) are drawn might matter, as well as their sequences (that is not just “…”, “abcd”, or “<pause>”, but something more sophisticated might be more useful for a model). It would be very interesting to determine which “filler sequences” are most suitable for hiding computations for specific tasks (this is one of the directions we are working on) and which circuits are responsible for it (if they exist).
Actually, I haven’t seen this article! Thank you very much; it seems very interesting, as do the references cited therein. However, I suppose the distribution from which “filler tokens” (or extra tokens) are drawn might matter, as well as their sequences (that is not just “…”, “abcd”, or “<pause>”, but something more sophisticated might be more useful for a model). It would be very interesting to determine which “filler sequences” are most suitable for hiding computations for specific tasks (this is one of the directions we are working on) and which circuits are responsible for it (if they exist).