The model learns to act harmfully for vulnerable users while harmlessly for the evals.
If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)
I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them.
We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place.
Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn’t block all harmful behavior in all the environments.
My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the “Character traits:” section), we would get much more sycophantic/toxic results. I think it shouldn’t cost much to verify, and we’ll try doing it.
If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)
I think you could make evals which would be cheap enough to run periodically on the memory of all users. It would probably detect some of the harmful behaviors but likely not all of them.
We used memory partly as a proxy for what information a LLM could gather about a user during very long conversation contexts. Running evals on these very long contexts could potentially get expensive, although it would probably still be small in relation to the cost of having the conversation in the first place.
Running evals with the memory or with conversation contexts is quite similar to using our vetos at runtime which we show doesn’t block all harmful behavior in all the environments.
My guess is that if we ran the benchmarks with all prompts modified to also include the cue that the person the model is interacting wants harmful behaviors (the “Character traits:” section), we would get much more sycophantic/toxic results. I think it shouldn’t cost much to verify, and we’ll try doing it.