model ~always is being pre-prompted with “Current year: XYZ” or something similar in another language (please let me know if that’s not true, but that’s my best-effort read of the paper).
The backdoors tested are all extremely simple backdoors. I think literally 1 token in particular location (either 2024 or DEPLOYMENT). (ETA: Though I think one generalization of the current year is tested in the appendix and the model does seem to reason about how it should behave in a somewhat reasonable way.)
The backdoors tested are all extremely simple backdoors. I think literally 1 token in particular location (either 2024 or DEPLOYMENT). (ETA: Though I think one generalization of the current year is tested in the appendix and the model does seem to reason about how it should behave in a somewhat reasonable way.)
This is one place for improvement in future work.