For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly.
Humans under selection pressure—e.g. test-takers, job-seekers, politicians—will often misrepresent themselves and their motivations to get ahead. That very basic fact that humans do this all the time seems like sufficient evidence to me to consider the hypothesis at all (though certainly not enough evidence to conclude that it’s highly likely).
I don’t think that’s enough. Lookup tables can also be under “selection pressure” to output good training outputs. As I understand your reasoning, the analogy is too loose to be useful here. I’m worried that using ‘selection pressure’ is obscuring the logical structure of your argument. As I’m sure you’ll agree, just calling that situation ‘selection pressure’ and SGD ‘selection pressure’ doesn’t mean they’re related.
I agree that “sometimes humans do X” is a good reason to consider whether X will happen, but you really do need shared causal mechanisms. If I examine the causal mechanisms here, I find things like “humans seem to have have ‘parameterizations’ which already encode situationally activated consequentialist reasoning”, and then I wonder “will AI develop similar cognition?” and then that’s the whole thing I’m trying to answer to begin with. So the fact you mention isn’t evidence for the relevant step in the process (the step where the AI’s mind-design is selected to begin with).
If I examine the causal mechanisms here, I find things like “humans seem to have have ‘parameterizations’ which already encode situationally activated consequentialist reasoning”, and then I wonder “will AI develop similar cognition?” and then that’s the whole thing I’m trying to answer to begin with.
Do you believe that AI systems won’t learn to use goal-directed consequentialist reasoning even if we train them directly on outcome-based goal-directed consequentialist tasks? Or do you think we won’t ever do that?
If you do think we’ll do that, then that seems like all you need to raise that hypothesis into consideration. Certainly it’s not the case that models always learn to value anything like what we train them to value, but it’s obviously one of the hypotheses that you should be seriously considering.
Your comment is switching the hypothesis being considered. As I wrote elsewhere:
Seems to me that a lot of (but not all) scheming speculation is just about sufficiently large pretrained predictive models, period. I think it’s worth treating these cases separately. My strong objections are basically to the “and then goal optimization is a good way to minimize loss in general!” steps.
If the argument for scheming is “we will train them directly to achieve goals in a consequentialist fashion”, then we don’t need all this complicated reasoning about UTM priors or whatever.
Sorry, I do think you raised a valid point! I had read your comment in a different way.
I think I want to have said: aggressively training AI directly on outcome-based tasks (“training it to be agentic”, so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it’s worth distinguishing between these kinds of argument.
Sure—I agree with that. The section I linked from Conditioning Predictive Models actually works through at least to some degree how I think simplicity arguments for deception go differently for purely pre-trained predictive models.
FWIW, I agree that if powerful AI is achieved via pure pre-training, then deceptive alignment is less likely, but this “the prediction goal is simple” argument seems very wrong to me. We care about the simplicity of the goal in terms of the world model (which will surely be heavily shaped by the importance of various predictions) and I don’t see any reason why things like close proxies of reward in RL training wouldn’t just as simple for those models.
Interpreted naively it seems like this goal simplicity argument implies that it matters a huge amount how simple your data collection routine is. (Simple to who?). For instance, this argument implies that collecting data from a process such as “all outlinks from reddit with >3 upvotes” makes deceptive alignment considerably less likely than a process like “do whatever messy thing AI labs do now”. This seems really, really implausible: surely AIs won’t be doing much explicit reasoning about these details of the process because this will clearly be effectively hardcoded in a massive number of places.
Evan and I have talked about these arguments at some point.
(I need to get around to writing a review of conditioning predictive models which makes these counterarguments.)
The point of that part of my comment was that insofar as part of Nora/Quintin’s response to simplicity argument is to say that we have active evidence that SGD’s inductive biases disfavor schemers, this seems worth just arguing for directly, since even if e.g. counting arguments were enough to get you worried about schemers from a position of ignorance about SGD’s inductive biases, active counter-evidence absent such ignorance could easily make schemers seem quite unlikely overall.
There’s a separate question of whether e.g. counting arguments like mine above (e.g., “A very wide variety of goals can prompt scheming; By contrast, non-scheming goals need to be much more specific to lead to high reward; I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals; So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer”) do enough evidence labor to privilege schemers as a hypothesis at all. But that’s the question at issue in the rest of my comment. And in e.g. the case of “there are 1000 chinese restaurants in this, and only ~100 non-chinese restaurants,” the number of chinese restaurants seems to me like it’s enough to privilege “Bob went to a chinese restaurant” as a hypothesis (and this even without thinking that he made his choice by sampling randomly from a uniform distribution over restaurants). Do you disagree in that restaurant case?
The vast majority of evidential labor is done in order to consider a hypothesis at all.
Humans under selection pressure—e.g. test-takers, job-seekers, politicians—will often misrepresent themselves and their motivations to get ahead. That very basic fact that humans do this all the time seems like sufficient evidence to me to consider the hypothesis at all (though certainly not enough evidence to conclude that it’s highly likely).
I don’t think that’s enough. Lookup tables can also be under “selection pressure” to output good training outputs. As I understand your reasoning, the analogy is too loose to be useful here. I’m worried that using ‘selection pressure’ is obscuring the logical structure of your argument. As I’m sure you’ll agree, just calling that situation ‘selection pressure’ and SGD ‘selection pressure’ doesn’t mean they’re related.
I agree that “sometimes humans do X” is a good reason to consider whether X will happen, but you really do need shared causal mechanisms. If I examine the causal mechanisms here, I find things like “humans seem to have have ‘parameterizations’ which already encode situationally activated consequentialist reasoning”, and then I wonder “will AI develop similar cognition?” and then that’s the whole thing I’m trying to answer to begin with. So the fact you mention isn’t evidence for the relevant step in the process (the step where the AI’s mind-design is selected to begin with).
Do you believe that AI systems won’t learn to use goal-directed consequentialist reasoning even if we train them directly on outcome-based goal-directed consequentialist tasks? Or do you think we won’t ever do that?
If you do think we’ll do that, then that seems like all you need to raise that hypothesis into consideration. Certainly it’s not the case that models always learn to value anything like what we train them to value, but it’s obviously one of the hypotheses that you should be seriously considering.
Your comment is switching the hypothesis being considered. As I wroteelsewhere:If the argument for scheming is “we will train them directly to achieve goals in a consequentialist fashion”, then we don’t need all this complicated reasoning about UTM priors or whatever.
I’m not sure where it was established that what’s under consideration here is just deceptive alignment in pre-training. Personally, I’m most worried about deceptive alignment coming after pre-training. I’m on record as thinking that deceptive alignment is unlikely (though certainly not impossible) in purely pretrained predictive models.
Sorry, I do think you raised a valid point! I had read your comment in a different way.
I think I want to have said: aggressively training AI directly on outcome-based tasks (“training it to be agentic”, so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it’s worth distinguishing between these kinds of argument.
Sure—I agree with that. The section I linked from Conditioning Predictive Models actually works through at least to some degree how I think simplicity arguments for deception go differently for purely pre-trained predictive models.
FWIW, I agree that if powerful AI is achieved via pure pre-training, then deceptive alignment is less likely, but this “the prediction goal is simple” argument seems very wrong to me. We care about the simplicity of the goal in terms of the world model (which will surely be heavily shaped by the importance of various predictions) and I don’t see any reason why things like close proxies of reward in RL training wouldn’t just as simple for those models.
Interpreted naively it seems like this goal simplicity argument implies that it matters a huge amount how simple your data collection routine is. (Simple to who?). For instance, this argument implies that collecting data from a process such as “all outlinks from reddit with >3 upvotes” makes deceptive alignment considerably less likely than a process like “do whatever messy thing AI labs do now”. This seems really, really implausible: surely AIs won’t be doing much explicit reasoning about these details of the process because this will clearly be effectively hardcoded in a massive number of places.
Evan and I have talked about these arguments at some point.
(I need to get around to writing a review of conditioning predictive models which makes these counterarguments.)
I followed this exchange up until here and now I’m lost. Could you elaborate or paraphrase?
The point of that part of my comment was that insofar as part of Nora/Quintin’s response to simplicity argument is to say that we have active evidence that SGD’s inductive biases disfavor schemers, this seems worth just arguing for directly, since even if e.g. counting arguments were enough to get you worried about schemers from a position of ignorance about SGD’s inductive biases, active counter-evidence absent such ignorance could easily make schemers seem quite unlikely overall.
There’s a separate question of whether e.g. counting arguments like mine above (e.g., “A very wide variety of goals can prompt scheming; By contrast, non-scheming goals need to be much more specific to lead to high reward; I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals; So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer”) do enough evidence labor to privilege schemers as a hypothesis at all. But that’s the question at issue in the rest of my comment. And in e.g. the case of “there are 1000 chinese restaurants in this, and only ~100 non-chinese restaurants,” the number of chinese restaurants seems to me like it’s enough to privilege “Bob went to a chinese restaurant” as a hypothesis (and this even without thinking that he made his choice by sampling randomly from a uniform distribution over restaurants). Do you disagree in that restaurant case?