Joel Burget comments on Anomalous tokens reveal the original identities of Instruct models

Joel Burget 10 Feb 2023 13:56 UTC
1 point
0
One more, related to your first point: I wouldn’t expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?