Marcus Williams comments on On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams 12 Nov 2024 11:20 UTC
4 points
2
I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them.
We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place.
Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn’t block all harmful behavior in all the environments.