Rohin Shah comments on The “no sandbagging on checkable tasks” hypothesis

Rohin Shah 3 Aug 2023 12:11 UTC
LW: 3 AF: 3
0
AF
Yeah, that seems like a reasonable operationalization of “capable of doing X”. So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.
- Lukas Finnveden 5 Dec 2023 22:25 UTC
  LW: 3 AF: 2
  0
  AF Parent
  I think (5) also depends on further details.
  As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.
  But in my proposed operationalization, “you can get it to do X” is allowed to use a much greater amount of resources (“say, 1% of the pre-training budget”) than the test for whether the model is “capable of doing X” (“Say, at most 1000 data points”.)
  I think that’s important:
  - If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don’t think that the experiment is very relevant for whether “you can get it to do X” in the sort of high-stakes, high-efforts situations that I’m imagining that we’ll be in when we’re trying to eval/align AI models to avoid takeover.
    It seems super plausible that a low-effort attempt could fail, and then succeed later-on with 10 more years knowledge of best practices. I wouldn’t learn much from that happening once.
  - If both the 2023 and the 2033 attempts are really expensive and high-effort (e.g. 1% of pre-training budget), then I think it’s very plausible that the 2033 training run gave the model new capabilities that it didn’t have before.
    And in particular: capabilities that the model wouldn’t have been able to utilize in a takeover attempt that it was very motivated to do its best at. (Which is ultimately what we care about.)
  By a similar argument, I would think that (4) wouldn’t falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.
  One thing that I’m noticing while writing this is something like: The argument that “elicitation efforts would get to use ≥1% of the training budget” makes sense if we’re eliciting all the capabilities at once, or if there’s only a few important capabilities to elicit. But if we want to elicit ≥1000 different capabilities, and need to do separate attempts at each of them, then it’s unlikely that each attempt will get to use ≥1% of the training budget.
  And for low-effort elicitation attempts, there’ll just always be a good chance that capability-elicitation fails for random reasons unrelated to sand-bagging. So a prediction that elicitation consistently succeeds will easily be falsified.
  Not super sure what to do with that. One source of hope might be: If the model systematically tries sandbagging, and you can strongly finetune against that on 10 different types of capabilities, then there’s a good chance that you’ll have trained-out the sandbagging impulse across a much wider array of data. So maybe you can just do 10 high-effort elicitation attempts, on a random subset of capabilities, and hope that it generalizes. (This is basically an argument that there eventually won’t be ≥1000 different capabilities which requires separate elicitation efforts. Because there will be one fundamental thing that you need: “getting the model to always try its best and not try to sandbag”.)
  - Rohin Shah 6 Dec 2023 9:33 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I think I agree with all of that (with the caveat that it’s been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).