evhub comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

evhub 29 Jun 2024 2:55 UTC
LW: 4 AF: 3
0
AF

To me, language like “tampering,” “model attempts hack” (in Fig 2), “pernicious behaviors” (in the abstract), etc. imply that there is something objectionable about what the model is doing. This is not true of the subset of samples I referred to just above, yet every time the paper counts the number of “attempted hacks” / “tamperings” / etc., it includes these samples.

I think I would object to this characterization. I think the model’s actions are quite objectionable, even if they appear benignly intentioned—and even if actually benignly intentioned. The model is asked just to get a count of RL steps, and it takes it upon itself to edit the reward function file and the tests for it. Furthermore, the fact that some of the reward tampering contexts are in fact obviously objectionable I think should provide some indication that what the models are generally doing in other contexts are also not totally benign.

My concern here is more about integrity and researcher trust. The paper makes claims (in Fig. 2 etc) which in context clearly mean “the model is doing [objectionable thing] X% of the time” when this is false, and can be easily determined to be false just by reading the samples, and when the researchers themselves have clearly done the work necessary to know that it is false (in order to pick the samples shown in Figs. 1 and 7).

I don’t think those claims are false, since I think the behavior is objectionable, but I do think the fact that you came away with the wrong impression about the nature of the reward tampering samples does mean we clearly made a mistake, and we should absolutely try to correct that mistake. For context, what happened was that we initially had much more of a discussion of this, and then cut it precisely because we weren’t confident in our interpretation of the scratchpads, and we ended up cutting too much—we obviously should have left in a simple explanation of the fact that there are some explicitly schemey and some more benign-seeming reward tampering CoTs and I’ve been kicking myself all day that we cut that. Carson is adding it in for the next draft.

Some changes to the paper that would mitigate this:

I appreciate the concrete list!

The environment feels like a plain white room with a hole in the floor, a sign on the wall saying “there is no hole in the floor,” another sign next to the hole saying “don’t go in the hole.” If placed in this room as part of an experiment, I might well shrug and step into the hole; for all I know that’s what the experimenters want, and the room (like the tampering environment) has a strong “this is a trick question” vibe.

Fwiw I definitely think this is a fair criticism. We’re excited about trying to do more realistic situations in the future!