Buck comments on Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Buck 19 Apr 2024 21:04 UTC
LW: 12 AF: 10
5
AF
I like this post and this research direction, I agree with almost everything you say, and I think you’re doing an unusually good job of explaining why you think your work is useful.
A nitpick: I think you’re using the term “scalable oversight” in a nonstandard and confusing way.
You say that scalable oversight is a more general version of “given a good model and a bad model, determine which one is good.” I imagine that more general sense you wanted is something like: you can implement some metric that tells you how “good” a model is, which can be applied not only to distinguish good from bad models (by comparing their metric values) but also can hopefully be used to train the models.
I think that your definition of scalable oversight here is broader than people normally use. In particular, I usually think of scalable oversight as the problem of making it so that we’re better able to make a procedure that tell us how good a model’s actions are on a particular trajectory; I think of it as excluding the problem of determining whether a model’s behaviors would be bad on some other trajectory that we aren’t considering. (This is how I use the term here, how Ansh uses it here, and how I interpret the usage in Concrete Problems and in Measuring Progress on Scalable Oversight for Large Language Models.)
I think that it’s good to have a word for the problem of assessing model actions on particular trajectories, and I think it’s probably good to distinguish between problems associated with that assessment and other problems; scalable oversight is the current standard choice for that.
Using your usage, I think scalable oversight suffices to solve the whole safety problem. Your usage also doesn’t play nicely with the low-stakes/high-stakes decomposition.
I’d prefer that you phrased this all by saying:

It might be the case that we aren’t able to behaviorally determine whether our model is bad or not. This could be because of a failure of scalable oversight (that is, it’s currently doing actions that we can’t tell are good), or because of concerns about failures that we can’t solve by training (that is, we know that it isn’t taking bad actions now, but we’re worried that it might do so in the future, either because of distribution shift or rare failure). Let’s just talk about the special case where we want to distinguish between two models which and we don’t have examples where the two models behaviorally differ. We think that it is good to research strategies that allow us to distinguish models in this case.
- Sam Marks 20 Apr 2024 20:52 UTC
  LW: 8 AF: 6
  4
  AF Parent
  I (mostly; see below) agree that in this post I used the term “scalable oversight” in a way which is non-standard and, moreover, in conflict way the way I typically use the term personally. I also agree with the implicit meta-point that it’s important to be careful about using terminology in a consistent way (though I probably don’t think it’s as important as you do). So overall, after reading this comment, I wish I had been more careful about how I treated the term “scalable oversight.” After I post this comment, I’ll make some edits for clarity, but I don’t expect to go so far as to change the title^[1].
  Two points in my defense:
  1. Even though “scalable oversight” isn’t an appropriate description for the narrow technical problem I pose here, the way I expect progress on this agenda to actually get applied is well-described as scalable oversight.
  2. I’ve found the scalable oversight frame on this problem useful both for my own thinking about it and for explaining it to others.
  Re (1): I spend most of my time thinking about the sycophantic reward hacking threat model. So in my head, some of the model’s outputs really are bad but it’s hard to notice this. Here are two ways that I think this agenda could help with noticing bad particular outputs:
  1. By applying DBIC to create classifiers for particular bad things (e.g. measurement tampering) which we apply to detect bad outputs.
  2. By giving us a signal about which episodes should be more closely scrutinized, and which aspects of those episodes we should scrutinize. (For example, suppose you notice that your model is thinking about a particular camera in a maybe-suspicious way, so you look for tricky ways that that camera could have been tampered with, and after a bunch of targeted scrutiny you notice a hack).
  I think that both of these workflows are accurately described as scalable oversight.
  Re (2): when I explain that I want to apply interpretability to scalable oversight, people—including people that I really expected to know better—often react with surprise. This isn’t, I think, because they’re thinking carefully about what scalable oversight means the way that you are. Rather, it seems that a lot of people split alignment work into two non-interacting magisteria called “scalable oversight” and “solving deceptive alignment,” and they classify interpretability work as being part of the latter magisterium. Such people tend to not realize that e.g. ELK is centrally a scalable oversight agenda, and I think of my proposed agenda here as attempting to make progress on ELK (or on special cases thereof).
  I guess my post muddies the water on all of the above by bringing up scheming; even though this technically fits into the setting I propose to make progress on, I don’t really view it as the central problem I’m trying to solve.
  1. ^
    Sadly, if I say that my goal is to use interpretability to “evaluate models” then I think people will pattern-match this to “evals” which typically means something different, e.g. checking for dangerous capabilities. I can’t really think of a better, non-confusing term for the task of “figuring out whether a model is good or bad.” Also, I expect that the ways progress on this agenda will actually be applied do count as “scalable oversight”; see below.
  - Sam Marks 20 Apr 2024 21:11 UTC
    LW: 2 AF: 1
    0
    AF Parent
    (Edits made. In the edited version, I think the only questionable things are the title and the line “[In this post, I will a]rticulate a class of approaches to scalable oversight I call cognition-based oversight.” Maybe I should be even more careful and instead say that cognition-based oversight is merely something that “could be useful for scalable oversight,” but I overall feel okay about this.
    Everywhere else, I think the term “scalable oversight” is now used in the standard way.)