The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).[1]
I think as phrased this is either not true, or tautological, or otherwise imprecisely specified (in particular I’m not sure what it means for a model to be “capable of” doing some task X—so far papers define that to be “can you quickly finetune the model to do X”; if you use that definition then it’s tautological).
Here are some hypotheticals, all of which seem plausible to me, that I think are useful test cases for your hypothesis (and would likely falsify a reasonable reading of it):
You spend T time trying to prompt a model to solve a task X, and fail to do so, and declare that the model can’t do X. Later someone else spends T time trying to prompt the same model to solve X, and succeeds, because they thought of a better prompt than you did.
Like (1), but both you and the other person tried lots of current techniques (prompting, finetuning, chain of thought, etc).
You spend $100 million pretraining a model, and then spend $1,000 of compute to finetune it, and observe it can only get a 50% success rate, so you declare it incapable of doing task X. Later you spend $1 million of compute to finetune it (with a correspondingly bigger dataset), and observe it can now get a 95% accuracy on the task.
Like (3), but later you still spend $1,000 of compute to finetune it, but with a much more curated and high-quality dataset, which gets you from 50% to 95%.
You evaluate GPT-4 using existing techniques and observe that it can’t do task X. In 2033, somebody goes back and reevaluates GPT-4 using 2033 techniques (with the same data and similar compute, let’s say) and now it does well on task X.
You evaluate a model using existing techniques and observe that it can’t do task X. A domain expert comes in and looks at the transcripts of the models, figures out the key things the model is struggling with, writes up a set of guidelines, and puts those in the prompt. The model can now do task X.
Like (6), but the domain expert is a different AI system. (Perhaps one with a larger base model, or perhaps one that was engineered with a lot of domain-specific schlep.)
I think you probably want your hypothesis to instead say something like “if given full autonomy, the AI system could not perform better at task X than what we can elicit with currently-available techniques”. (You’d also want to postulate that the AI system only gets to use a similar amount of resources for finetuning as we use.)
(At some point AI will be better at eliciting capabilities from other AI systems than unaided humans; so at that point “currently-available techniques” would have to include ones that leverage AI systems for eliciting capabilities if we wanted the statement to continue to hold. This is a feature of the definition, not a bug.)
For models that can’t gradient hack: The model is “capable of doing X” if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)
(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren’t optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)
For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is “capable of doing X” if it would start doing X if doing X was a valuable instrumental goal, for it.
For both kinds: “you can get it to do X” if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred.
Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the “finetuning” definition of “capable of doing X” might be ok if you include the possibility of finetuning on hypothetical datasets that we don’t have access to. (Since we only know how to check the task — not perform it.)
Yeah, that seems like a reasonable operationalization of “capable of doing X”. So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.
As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.
But in my proposed operationalization, “you can get it to do X” is allowed to use a much greater amount of resources (“say, 1% of the pre-training budget”) than the test for whether the model is “capable of doing X” (“Say, at most 1000 data points”.)
I think that’s important:
If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don’t think that the experiment is very relevant for whether “you can get it to do X” in the sort of high-stakes, high-efforts situations that I’m imagining that we’ll be in when we’re trying to eval/align AI models to avoid takeover.
It seems super plausible that a low-effort attempt could fail, and then succeed later-on with 10 more years knowledge of best practices. I wouldn’t learn much from that happening once.
If both the 2023 and the 2033 attempts are really expensive and high-effort (e.g. 1% of pre-training budget), then I think it’s very plausible that the 2033 training run gave the model new capabilities that it didn’t have before.
And in particular: capabilities that the model wouldn’t have been able to utilize in a takeover attempt that it was very motivated to do its best at. (Which is ultimately what we care about.)
By a similar argument, I would think that (4) wouldn’t falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.
One thing that I’m noticing while writing this is something like: The argument that “elicitation efforts would get to use ≥1% of the training budget” makes sense if we’re eliciting all the capabilities at once, or if there’s only a few important capabilities to elicit. But if we want to elicit ≥1000 different capabilities, and need to do separate attempts at each of them, then it’s unlikely that each attempt will get to use ≥1% of the training budget.
And for low-effort elicitation attempts, there’ll just always be a good chance that capability-elicitation fails for random reasons unrelated to sand-bagging. So a prediction that elicitation consistently succeeds will easily be falsified.
Not super sure what to do with that. One source of hope might be: If the model systematically tries sandbagging, and you can strongly finetune against that on 10 different types of capabilities, then there’s a good chance that you’ll have trained-out the sandbagging impulse across a much wider array of data. So maybe you can just do 10 high-effort elicitation attempts, on a random subset of capabilities, and hope that it generalizes. (This is basically an argument that there eventually won’t be ≥1000 different capabilities which requires separate elicitation efforts. Because there will be one fundamental thing that you need: “getting the model to always try its best and not try to sandbag”.)
I think I agree with all of that (with the caveat that it’s been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).
I read “capable of X” as meaning something like “if the model was actively trying to do X then it would do X”. I.e. a misaligned model doesn’t reveal the vulnerability to humans during testing bc it doesn’t want them to patch it, but then later it exploits that same vulnerability during deployment bc it’s trying to hack the computer system
Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn’t enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?
I think as phrased this is either not true, or tautological, or otherwise imprecisely specified (in particular I’m not sure what it means for a model to be “capable of” doing some task X—so far papers define that to be “can you quickly finetune the model to do X”; if you use that definition then it’s tautological).
Here are some hypotheticals, all of which seem plausible to me, that I think are useful test cases for your hypothesis (and would likely falsify a reasonable reading of it):
You spend T time trying to prompt a model to solve a task X, and fail to do so, and declare that the model can’t do X. Later someone else spends T time trying to prompt the same model to solve X, and succeeds, because they thought of a better prompt than you did.
Like (1), but both you and the other person tried lots of current techniques (prompting, finetuning, chain of thought, etc).
You spend $100 million pretraining a model, and then spend $1,000 of compute to finetune it, and observe it can only get a 50% success rate, so you declare it incapable of doing task X. Later you spend $1 million of compute to finetune it (with a correspondingly bigger dataset), and observe it can now get a 95% accuracy on the task.
Like (3), but later you still spend $1,000 of compute to finetune it, but with a much more curated and high-quality dataset, which gets you from 50% to 95%.
You evaluate GPT-4 using existing techniques and observe that it can’t do task X. In 2033, somebody goes back and reevaluates GPT-4 using 2033 techniques (with the same data and similar compute, let’s say) and now it does well on task X.
You evaluate a model using existing techniques and observe that it can’t do task X. A domain expert comes in and looks at the transcripts of the models, figures out the key things the model is struggling with, writes up a set of guidelines, and puts those in the prompt. The model can now do task X.
Like (6), but the domain expert is a different AI system. (Perhaps one with a larger base model, or perhaps one that was engineered with a lot of domain-specific schlep.)
I think you probably want your hypothesis to instead say something like “if given full autonomy, the AI system could not perform better at task X than what we can elicit with currently-available techniques”. (You’d also want to postulate that the AI system only gets to use a similar amount of resources for finetuning as we use.)
(At some point AI will be better at eliciting capabilities from other AI systems than unaided humans; so at that point “currently-available techniques” would have to include ones that leverage AI systems for eliciting capabilities if we wanted the statement to continue to hold. This is a feature of the definition, not a bug.)
Here’s a proposed operationalization.
For models that can’t gradient hack: The model is “capable of doing X” if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)
(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren’t optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)
For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is “capable of doing X” if it would start doing X if doing X was a valuable instrumental goal, for it.
For both kinds: “you can get it to do X” if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred.
Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the “finetuning” definition of “capable of doing X” might be ok if you include the possibility of finetuning on hypothetical datasets that we don’t have access to. (Since we only know how to check the task — not perform it.)
Yeah, that seems like a reasonable operationalization of “capable of doing X”. So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.
I think (5) also depends on further details.
As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.
But in my proposed operationalization, “you can get it to do X” is allowed to use a much greater amount of resources (“say, 1% of the pre-training budget”) than the test for whether the model is “capable of doing X” (“Say, at most 1000 data points”.)
I think that’s important:
If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don’t think that the experiment is very relevant for whether “you can get it to do X” in the sort of high-stakes, high-efforts situations that I’m imagining that we’ll be in when we’re trying to eval/align AI models to avoid takeover.
It seems super plausible that a low-effort attempt could fail, and then succeed later-on with 10 more years knowledge of best practices. I wouldn’t learn much from that happening once.
If both the 2023 and the 2033 attempts are really expensive and high-effort (e.g. 1% of pre-training budget), then I think it’s very plausible that the 2033 training run gave the model new capabilities that it didn’t have before.
And in particular: capabilities that the model wouldn’t have been able to utilize in a takeover attempt that it was very motivated to do its best at. (Which is ultimately what we care about.)
By a similar argument, I would think that (4) wouldn’t falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.
One thing that I’m noticing while writing this is something like: The argument that “elicitation efforts would get to use ≥1% of the training budget” makes sense if we’re eliciting all the capabilities at once, or if there’s only a few important capabilities to elicit. But if we want to elicit ≥1000 different capabilities, and need to do separate attempts at each of them, then it’s unlikely that each attempt will get to use ≥1% of the training budget.
And for low-effort elicitation attempts, there’ll just always be a good chance that capability-elicitation fails for random reasons unrelated to sand-bagging. So a prediction that elicitation consistently succeeds will easily be falsified.
Not super sure what to do with that. One source of hope might be: If the model systematically tries sandbagging, and you can strongly finetune against that on 10 different types of capabilities, then there’s a good chance that you’ll have trained-out the sandbagging impulse across a much wider array of data. So maybe you can just do 10 high-effort elicitation attempts, on a random subset of capabilities, and hope that it generalizes. (This is basically an argument that there eventually won’t be ≥1000 different capabilities which requires separate elicitation efforts. Because there will be one fundamental thing that you need: “getting the model to always try its best and not try to sandbag”.)
I think I agree with all of that (with the caveat that it’s been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).
I read “capable of X” as meaning something like “if the model was actively trying to do X then it would do X”. I.e. a misaligned model doesn’t reveal the vulnerability to humans during testing bc it doesn’t want them to patch it, but then later it exploits that same vulnerability during deployment bc it’s trying to hack the computer system
Which of (1)-(7) above would falsify the hypothesis if observed? Or if there isn’t enough information, what additional information do you need to tell whether the hypothesis has been falsified or not?