Humans under selection pressure—e.g. test-takers, job-seekers, politicians—will often misrepresent themselves and their motivations to get ahead. That very basic fact that humans do this all the time seems like sufficient evidence to me to consider the hypothesis at all (though certainly not enough evidence to conclude that it’s highly likely).
I don’t think that’s enough. Lookup tables can also be under “selection pressure” to output good training outputs. As I understand your reasoning, the analogy is too loose to be useful here. I’m worried that using ‘selection pressure’ is obscuring the logical structure of your argument. As I’m sure you’ll agree, just calling that situation ‘selection pressure’ and SGD ‘selection pressure’ doesn’t mean they’re related.
I agree that “sometimes humans do X” is a good reason to consider whether X will happen, but you really do need shared causal mechanisms. If I examine the causal mechanisms here, I find things like “humans seem to have have ‘parameterizations’ which already encode situationally activated consequentialist reasoning”, and then I wonder “will AI develop similar cognition?” and then that’s the whole thing I’m trying to answer to begin with. So the fact you mention isn’t evidence for the relevant step in the process (the step where the AI’s mind-design is selected to begin with).
If I examine the causal mechanisms here, I find things like “humans seem to have have ‘parameterizations’ which already encode situationally activated consequentialist reasoning”, and then I wonder “will AI develop similar cognition?” and then that’s the whole thing I’m trying to answer to begin with.
Do you believe that AI systems won’t learn to use goal-directed consequentialist reasoning even if we train them directly on outcome-based goal-directed consequentialist tasks? Or do you think we won’t ever do that?
If you do think we’ll do that, then that seems like all you need to raise that hypothesis into consideration. Certainly it’s not the case that models always learn to value anything like what we train them to value, but it’s obviously one of the hypotheses that you should be seriously considering.
Your comment is switching the hypothesis being considered. As I wrote elsewhere:
Seems to me that a lot of (but not all) scheming speculation is just about sufficiently large pretrained predictive models, period. I think it’s worth treating these cases separately. My strong objections are basically to the “and then goal optimization is a good way to minimize loss in general!” steps.
If the argument for scheming is “we will train them directly to achieve goals in a consequentialist fashion”, then we don’t need all this complicated reasoning about UTM priors or whatever.
Sorry, I do think you raised a valid point! I had read your comment in a different way.
I think I want to have said: aggressively training AI directly on outcome-based tasks (“training it to be agentic”, so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it’s worth distinguishing between these kinds of argument.
Sure—I agree with that. The section I linked from Conditioning Predictive Models actually works through at least to some degree how I think simplicity arguments for deception go differently for purely pre-trained predictive models.
FWIW, I agree that if powerful AI is achieved via pure pre-training, then deceptive alignment is less likely, but this “the prediction goal is simple” argument seems very wrong to me. We care about the simplicity of the goal in terms of the world model (which will surely be heavily shaped by the importance of various predictions) and I don’t see any reason why things like close proxies of reward in RL training wouldn’t just as simple for those models.
Interpreted naively it seems like this goal simplicity argument implies that it matters a huge amount how simple your data collection routine is. (Simple to who?). For instance, this argument implies that collecting data from a process such as “all outlinks from reddit with >3 upvotes” makes deceptive alignment considerably less likely than a process like “do whatever messy thing AI labs do now”. This seems really, really implausible: surely AIs won’t be doing much explicit reasoning about these details of the process because this will clearly be effectively hardcoded in a massive number of places.
Evan and I have talked about these arguments at some point.
(I need to get around to writing a review of conditioning predictive models which makes these counterarguments.)
Humans under selection pressure—e.g. test-takers, job-seekers, politicians—will often misrepresent themselves and their motivations to get ahead. That very basic fact that humans do this all the time seems like sufficient evidence to me to consider the hypothesis at all (though certainly not enough evidence to conclude that it’s highly likely).
I don’t think that’s enough. Lookup tables can also be under “selection pressure” to output good training outputs. As I understand your reasoning, the analogy is too loose to be useful here. I’m worried that using ‘selection pressure’ is obscuring the logical structure of your argument. As I’m sure you’ll agree, just calling that situation ‘selection pressure’ and SGD ‘selection pressure’ doesn’t mean they’re related.
I agree that “sometimes humans do X” is a good reason to consider whether X will happen, but you really do need shared causal mechanisms. If I examine the causal mechanisms here, I find things like “humans seem to have have ‘parameterizations’ which already encode situationally activated consequentialist reasoning”, and then I wonder “will AI develop similar cognition?” and then that’s the whole thing I’m trying to answer to begin with. So the fact you mention isn’t evidence for the relevant step in the process (the step where the AI’s mind-design is selected to begin with).
Do you believe that AI systems won’t learn to use goal-directed consequentialist reasoning even if we train them directly on outcome-based goal-directed consequentialist tasks? Or do you think we won’t ever do that?
If you do think we’ll do that, then that seems like all you need to raise that hypothesis into consideration. Certainly it’s not the case that models always learn to value anything like what we train them to value, but it’s obviously one of the hypotheses that you should be seriously considering.
Your comment is switching the hypothesis being considered. As I wroteelsewhere:If the argument for scheming is “we will train them directly to achieve goals in a consequentialist fashion”, then we don’t need all this complicated reasoning about UTM priors or whatever.
I’m not sure where it was established that what’s under consideration here is just deceptive alignment in pre-training. Personally, I’m most worried about deceptive alignment coming after pre-training. I’m on record as thinking that deceptive alignment is unlikely (though certainly not impossible) in purely pretrained predictive models.
Sorry, I do think you raised a valid point! I had read your comment in a different way.
I think I want to have said: aggressively training AI directly on outcome-based tasks (“training it to be agentic”, so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it’s worth distinguishing between these kinds of argument.
Sure—I agree with that. The section I linked from Conditioning Predictive Models actually works through at least to some degree how I think simplicity arguments for deception go differently for purely pre-trained predictive models.
FWIW, I agree that if powerful AI is achieved via pure pre-training, then deceptive alignment is less likely, but this “the prediction goal is simple” argument seems very wrong to me. We care about the simplicity of the goal in terms of the world model (which will surely be heavily shaped by the importance of various predictions) and I don’t see any reason why things like close proxies of reward in RL training wouldn’t just as simple for those models.
Interpreted naively it seems like this goal simplicity argument implies that it matters a huge amount how simple your data collection routine is. (Simple to who?). For instance, this argument implies that collecting data from a process such as “all outlinks from reddit with >3 upvotes” makes deceptive alignment considerably less likely than a process like “do whatever messy thing AI labs do now”. This seems really, really implausible: surely AIs won’t be doing much explicit reasoning about these details of the process because this will clearly be effectively hardcoded in a massive number of places.
Evan and I have talked about these arguments at some point.
(I need to get around to writing a review of conditioning predictive models which makes these counterarguments.)
I followed this exchange up until here and now I’m lost. Could you elaborate or paraphrase?