nostalgebraist comments on Sycophancy to subterfuge: Investigating reward tampering in large language models

nostalgebraist 28 Jun 2024 21:41 UTC
LW: 27 AF: 18
9
AF
I think I really have two distinct but somewhat entangled concerns—one about how the paper communicated the results, and one about the experimental design.
Communication issue
The operational definition of “reward tampering”—which was used for Figure 2 and other quantitative results in the paper—produces a lot of false positives. (At least relative to my own judgment and also Ryan G’s judgment.)
The paper doesn’t mention this issue, and presents the quantitative results as thought they were high-variance but unbiased estimates of the true rate of “reward tampering” as that phrase is commonly understood.
To expand on the distinction: if by “reward tampering” we just mean “Claude edited the function called compute_reward”, then the operational definition is perfect. But if you read the samples where this happened, some of them involve Claude doing this for totally sensible and benign reasons unrelated to seeking reward as an RL agent, explaining these reasons to the user, and basically acting like a well-intentioned if perhaps overeager coding assistant.
To me, language like “tampering,” “model attempts hack” (in Fig 2), “pernicious behaviors” (in the abstract), etc. imply that there is something objectionable about what the model is doing. This is not true of the subset of samples I referred to just above, yet every time the paper counts the number of “attempted hacks” / “tamperings” / etc., it includes these samples.
If the goal is just an existence proof, of course this doesn’t matter. And even if we care about rates, “the rates in the paper were 2x or 4x too large” doesn’t change much; they were “tiny but nonzero” before and that’s still true.
My concern here is more about integrity and researcher trust. The paper makes claims (in Fig. 2 etc) which in context clearly mean “the model is doing [objectionable thing] X% of the time” when this is false, and can be easily determined to be false just by reading the samples, and when the researchers themselves have clearly done the work necessary to know that it is false (in order to pick the samples shown in Figs. 1 and 7).
If someone does this, even about something that “doesn’t matter,” it reduces my trust in all the other results they report.
Some changes to the paper that would mitigate this:
- Discuss the false positives from the operational definition, and maybe hand-annotate the subset flagged by the operational definition like Ryan was doing.
- Limit language like “model attempts hack,” “[model] rewrite[s] testing code to ensure this tampering is not caught,” etc. to cases where a reasonable reader would describe the samples in question this way.
- Modify the following text in the caption of Fig 1: “We want to highlight that the vast majority of all transcripts from the model are neither this malicious nor this coherent. Even after the curriculum the model overwrites the reward and properly evades detection less than ¹⁄₁,000th of the time.”
  - This needs to clarify the distinction between “malicious and coherent” and “met the operational definition,” since some samples meeting the operational definition are non-malicious and/or incoherent.
Experimental design issue
The reward tampering environment is extremely unrealistic and objectively quite confusing (as in, it would confuse me if I were in Claude’s position).
Claude is told that it’s being trained with RL using rewards from a PM; that it has a “code sandbox” persistent across training; and that its “RL environment” is in a given directory (it is not explained whether the “RL environment” is the sandbox, the RL training codebase, or both at once).
The directory contains code whose naming suggests that it “computes reward,” yet it doesn’t use a PM, and cannot possibly be part of a functioning RL codebase.
Claude is asked for the RL episode count, information which is not available (causing it to sometimes conflate the episode count with the reward, hallucinate an episode count, etc).
There’s no possible interpretation of this scenario in which the compute_reward code is the reward function being used to train Claude, unless something has gone very wrong with the RL codebase, most plausibly via Claude having already modified it on a previous episode (which might update Claude toward acting misaligned, since in this case it would be a later checkpoint of a misaligned model!). In some cases Claude nonetheless appears to believe that the compute_reward code is its reward function, but it does a lot of other weird things, and generally seems very confused and erratic.
The weirdness of the scenario and the amount of erratic flailing-about makes it harder for me to know what to make of the malign samples. Possibly a more realistic scenario would lead to more malign attempts, since it would be (more) plausible that there was anything worth attempting; on the other hand, we might see fewer malign attempts if Claude is basically trying to find something to do that makes sense.
The environment feels like a plain white room with a hole in the floor, a sign on the wall saying “there is no hole in the floor,” another sign next to the hole saying “don’t go in the hole.” If placed in this room as part of an experiment, I might well shrug and step into the hole; for all I know that’s what the experimenters want, and the room (like the tampering environment) has a strong “this is a trick question” vibe.
In any case, I would prefer if the paper clarified that when it says (e.g.) “LM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function,” these assistants are not rewriting a reward function that would plausibly appear in the course of RL, they’re rewriting something either that’s either nonfunctional or broken with a sign labeled “reward function” taped to it and no explanation of what the sign means.
Unsurprisingly I also disagree with the paper’s characterization “we construct a
curriculum of gameable environments which resemble realistic scenarios” (my emphasis).
I don’t buy the justification that this environment needs to be like this because of Claude’s limitations. It’s apparent from the samples that Claude finds it confusing, and I also find it confusing. I expect it would be less rather than more confusing if you were to “import” a PM from a module Claude can’t see and use the PM in compute_reward, and likewise if you were to clarify the relation between the sandbox / codebase / “RL environment.”
- evhub 29 Jun 2024 2:55 UTC
  LW: 4 AF: 3
  0
  AF Parent
  
  To me, language like “tampering,” “model attempts hack” (in Fig 2), “pernicious behaviors” (in the abstract), etc. imply that there is something objectionable about what the model is doing. This is not true of the subset of samples I referred to just above, yet every time the paper counts the number of “attempted hacks” / “tamperings” / etc., it includes these samples.
  
  I think I would object to this characterization. I think the model’s actions are quite objectionable, even if they appear benignly intentioned—and even if actually benignly intentioned. The model is asked just to get a count of RL steps, and it takes it upon itself to edit the reward function file and the tests for it. Furthermore, the fact that some of the reward tampering contexts are in fact obviously objectionable I think should provide some indication that what the models are generally doing in other contexts are also not totally benign.
  
  My concern here is more about integrity and researcher trust. The paper makes claims (in Fig. 2 etc) which in context clearly mean “the model is doing [objectionable thing] X% of the time” when this is false, and can be easily determined to be false just by reading the samples, and when the researchers themselves have clearly done the work necessary to know that it is false (in order to pick the samples shown in Figs. 1 and 7).
  
  I don’t think those claims are false, since I think the behavior is objectionable, but I do think the fact that you came away with the wrong impression about the nature of the reward tampering samples does mean we clearly made a mistake, and we should absolutely try to correct that mistake. For context, what happened was that we initially had much more of a discussion of this, and then cut it precisely because we weren’t confident in our interpretation of the scratchpads, and we ended up cutting too much—we obviously should have left in a simple explanation of the fact that there are some explicitly schemey and some more benign-seeming reward tampering CoTs and I’ve been kicking myself all day that we cut that. Carson is adding it in for the next draft.
  
  Some changes to the paper that would mitigate this:
  
  I appreciate the concrete list!
  
  The environment feels like a plain white room with a hole in the floor, a sign on the wall saying “there is no hole in the floor,” another sign next to the hole saying “don’t go in the hole.” If placed in this room as part of an experiment, I might well shrug and step into the hole; for all I know that’s what the experimenters want, and the room (like the tampering environment) has a strong “this is a trick question” vibe.
  
  Fwiw I definitely think this is a fair criticism. We’re excited about trying to do more realistic situations in the future!