And yet it behaves remarkably sensibly. Train a one-layer transformer on 80% of possible addition-mod-59 problems, and it learns one of two modular addition algorithms, which perform correctly on the remaining validation set. It’s not a priori obvious that it would work that way! There are other possible functions on compatible with the training data.
Seems like Simplicia is missing the worrisome part—it’s not that the AI will learn a more complex algorithm which is still compatible with the training data; it’s that the simplest several algorithms compatible with the training data will kill all humans OOD.
This is the most optimistic believable scenario I’ve seen in quite a while!