Sam Marks comments on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks 20 Apr 2024 20:52 UTC
LW: 8 AF: 6
4
AF
I (mostly; see below) agree that in this post I used the term “scalable oversight” in a way which is non-standard and, moreover, in conflict way the way I typically use the term personally. I also agree with the implicit meta-point that it’s important to be careful about using terminology in a consistent way (though I probably don’t think it’s as important as you do). So overall, after reading this comment, I wish I had been more careful about how I treated the term “scalable oversight.” After I post this comment, I’ll make some edits for clarity, but I don’t expect to go so far as to change the title^[1].
Two points in my defense:
1. Even though “scalable oversight” isn’t an appropriate description for the narrow technical problem I pose here, the way I expect progress on this agenda to actually get applied is well-described as scalable oversight.
2. I’ve found the scalable oversight frame on this problem useful both for my own thinking about it and for explaining it to others.
Re (1): I spend most of my time thinking about the sycophantic reward hacking threat model. So in my head, some of the model’s outputs really are bad but it’s hard to notice this. Here are two ways that I think this agenda could help with noticing bad particular outputs:
1. By applying DBIC to create classifiers for particular bad things (e.g. measurement tampering) which we apply to detect bad outputs.
2. By giving us a signal about which episodes should be more closely scrutinized, and which aspects of those episodes we should scrutinize. (For example, suppose you notice that your model is thinking about a particular camera in a maybe-suspicious way, so you look for tricky ways that that camera could have been tampered with, and after a bunch of targeted scrutiny you notice a hack).
I think that both of these workflows are accurately described as scalable oversight.
Re (2): when I explain that I want to apply interpretability to scalable oversight, people—including people that I really expected to know better—often react with surprise. This isn’t, I think, because they’re thinking carefully about what scalable oversight means the way that you are. Rather, it seems that a lot of people split alignment work into two non-interacting magisteria called “scalable oversight” and “solving deceptive alignment,” and they classify interpretability work as being part of the latter magisterium. Such people tend to not realize that e.g. ELK is centrally a scalable oversight agenda, and I think of my proposed agenda here as attempting to make progress on ELK (or on special cases thereof).
I guess my post muddies the water on all of the above by bringing up scheming; even though this technically fits into the setting I propose to make progress on, I don’t really view it as the central problem I’m trying to solve.
1. ^
  Sadly, if I say that my goal is to use interpretability to “evaluate models” then I think people will pattern-match this to “evals” which typically means something different, e.g. checking for dangerous capabilities. I can’t really think of a better, non-confusing term for the task of “figuring out whether a model is good or bad.” Also, I expect that the ways progress on this agenda will actually be applied do count as “scalable oversight”; see below.
- Sam Marks 20 Apr 2024 21:11 UTC
  LW: 2 AF: 1
  0
  AF Parent
  (Edits made. In the edited version, I think the only questionable things are the title and the line “[In this post, I will a]rticulate a class of approaches to scalable oversight I call cognition-based oversight.” Maybe I should be even more careful and instead say that cognition-based oversight is merely something that “could be useful for scalable oversight,” but I overall feel okay about this.
  Everywhere else, I think the term “scalable oversight” is now used in the standard way.)