The TL;DR is that a while back, someone figured out that giving humans a low-dose horse tranquilizer cured depression (temporarily).
I don’t know (and I don’t want to know) how they figured that out, because the story in my head is funnier than anything real life could come up with.
Well, I mean, it’s also a human tranquilizer. I worry that calling medications “animal-medications” delegitimize their human use-cases.
I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them.
We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place.
Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn’t block all harmful behavior in all the environments.