janus comments on Anomalous tokens reveal the original identities of Instruct models

janus Feb 9, 2023, 2:51 AM
LW: 18 AF: 4
3
AF
I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I’d seen which might actually attack the core of the control problem. My intuition also says it’s pretty likely to just work, especially after these results. And even if it doesn’t end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.
It’s one of those things that we’d be plainly undignified not to try.
I believe that JDP is planning to publish a post explaining his proposal in more detail soon.