Alex Lawsen comments on alexrjl’s Shortform

Alex Lawsen 1 Feb 2023 11:22 UTC
1 point
0
I think there’s quite a big difference between ‘bad looking stuff gets selected away’ and ‘design a poisoned token’ and I was talking about the former in the top level comment, but as it happens I don’t think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.