I am one of the people who upvoted your comment but disagreement-downvoted it. I think you are unfairly attempting to shift the burden of proof here: “Your argument requires the assumption of “malign priors”—that is, a highly capable AI rates dangerous goal directed behaviour highly enough a priori to converge to this behaviour through training.”
It’s more like, “One way this argument could be wrong is if the surprising hypothesis of “benign priors” was true—that is, powerful goal-directed behavior is extremely low-prior in the learning algorithm, such that the training process can’t find this strategy/behavior/policy even though it would in fact lead to higher reward.”
Why would ordinary nondeceptive goodharting be an insurmountable problem?
So I think we agree that the assumption is required. I don’t fully agree with your summary: it’s not that it doesn’t find the behaviour, it’s that it doesn’t prefer the behaviour, and the reward bonus isn’t enough to shift its preference.
Here are two defences of the malign priors assumption:
If we assume that a powerful AI’s behaviour can be described by some simplicity prior over objectives then deceptive behaviour is likely.
The counting argument is really just another measure argument—deceptive goals outnumber nondeceptive ones by enough that “most” priors over goals will give them a lot more weight.
Now, you might think these arguments are really solid, but I think it’s important to recognise their limitations. First: AIs learn behaviours, not goals. A “natural prior” over behaviours that appears to exhibit good behaviour at low levels of capability might look like a strange prior over goals. The observation that an advanced AI must act in ways that looks goal-directed doesn’t contradict this—the fact that you sometimes look goal-directed does not imply that, once all things are considered, your goals don’t end up looking very strange.
Secondly, the design of AIs is partly constrained by mathematical convenience, but within those constraints people are going to pick designs that seem to work well. Now, deception is not the same as “seeming to do well”. Seeming to do well requires that similar but lesser models successfully carry out less complex tasks. The prior for the potentially deceptive model is chosen by iterating on the design of nondeceptive models. This is probably, from most points of view, a weird prior! It is not clear to me that the objective counting argument is relevant here—it might be, but it might not be.
Thirdly, the most impressive AI systems we have today do not operate according to reinforcement learning on a mathematically convenient prior. The prior employed by a reinforcement learning built on top of a large language model is not mathematically convenient; rather, it’s some kind of approximation of the distribution of texts that people produce.
The point about nondeceptive goodharting: suppose we have some training environment and a signal suitable for training AI (for no particular reason, I am thinking about “self-driving cars” and “passenger star ratings”). Suppose we have an AI not good enough to be effectively deceptive. We can consider two classes of behaviour A: aligned behaviour that gets good reward, B obviously misaligned behaviour that gets good reward. My guess is that B≫A. We want our cars to go for good ratings while obeying a whole lot of side constraints—road rules, picking up passengers fairly, not cheating the system etc. If we have an AI where counting arguments are conclusive with regard to its eventual behaviour, I think we get a really bad taxi.
Now, maybe these can be dealt with by putting a lot more effort into the reward signal (penalising for road rule breaking, adding fares as well as star ratings, penalising attempts to cheat in every way you can imagine...). This would, at a minimum, entail a lot more effort than business-as-ususal reinforcement learning, and my guess is that if behviour counting arguments still apply then it flat out wouldn’t work. That’s what I mean by “insurmountable”.
Alternatively, maybe we deal with these problems by picking a prior that promotes A compared to B. In fact, this seems to be a more realistic way of constructing a self-driving taxi that gets good passenger ratings—first, make it a safe car, then adjust its behaviour (within limits!) to get better ratings from passengers.
Now, it’s possible that even though we solve the B≫A problem with better priors, with higher capability the set C of objectives that yield deceptively misaligned behaviour outnumbers B by so much that the better priors still don’t help. However, I think this is once again speculative and if if it’s an assumption underpinning your argument you need to say so.
(Sure, in some sense we agree that the assumption is required, but I think that’s a misleading way of putting it, but whatever)
Thank you for the detailed and lengthy explanation! I agree with your first point probably, it seems to me to be similar to what the shard theory people are exploring and yes this is a promising line of research which may if we are lucky overturn the default hypothesis that misaligned-but-deceptive AIs are most likely. I say similar things about the second point I guess. Both points are just basically saying “we don’t know what the prior is like” so sure but they aren’t positive arguments that the prior will be benign. Not sure whether I agree with the third point but anyhow it also just seems to be a warning that we are ignorant about the prior, not an argument that the prior is benign.
I don’t think I understand your more detailed argument that begins with “the point about nondescriptive goodharting.” I’m tired now so will go away but hopefully will return and try to think more deeply about it. I strongly encourage you to write up a post on it, with emphasis on clarity. I really hope you are right!
I am one of the people who upvoted your comment but disagreement-downvoted it. I think you are unfairly attempting to shift the burden of proof here: “Your argument requires the assumption of “malign priors”—that is, a highly capable AI rates dangerous goal directed behaviour highly enough a priori to converge to this behaviour through training.”
It’s more like, “One way this argument could be wrong is if the surprising hypothesis of “benign priors” was true—that is, powerful goal-directed behavior is extremely low-prior in the learning algorithm, such that the training process can’t find this strategy/behavior/policy even though it would in fact lead to higher reward.”
Why would ordinary nondeceptive goodharting be an insurmountable problem?
So I think we agree that the assumption is required. I don’t fully agree with your summary: it’s not that it doesn’t find the behaviour, it’s that it doesn’t prefer the behaviour, and the reward bonus isn’t enough to shift its preference.
Here are two defences of the malign priors assumption:
If we assume that a powerful AI’s behaviour can be described by some simplicity prior over objectives then deceptive behaviour is likely.
By an informal count, there are more deceptive goals than nondeceptive ones
The counting argument is really just another measure argument—deceptive goals outnumber nondeceptive ones by enough that “most” priors over goals will give them a lot more weight.
Now, you might think these arguments are really solid, but I think it’s important to recognise their limitations. First: AIs learn behaviours, not goals. A “natural prior” over behaviours that appears to exhibit good behaviour at low levels of capability might look like a strange prior over goals. The observation that an advanced AI must act in ways that looks goal-directed doesn’t contradict this—the fact that you sometimes look goal-directed does not imply that, once all things are considered, your goals don’t end up looking very strange.
Secondly, the design of AIs is partly constrained by mathematical convenience, but within those constraints people are going to pick designs that seem to work well. Now, deception is not the same as “seeming to do well”. Seeming to do well requires that similar but lesser models successfully carry out less complex tasks. The prior for the potentially deceptive model is chosen by iterating on the design of nondeceptive models. This is probably, from most points of view, a weird prior! It is not clear to me that the objective counting argument is relevant here—it might be, but it might not be.
Thirdly, the most impressive AI systems we have today do not operate according to reinforcement learning on a mathematically convenient prior. The prior employed by a reinforcement learning built on top of a large language model is not mathematically convenient; rather, it’s some kind of approximation of the distribution of texts that people produce.
The point about nondeceptive goodharting: suppose we have some training environment and a signal suitable for training AI (for no particular reason, I am thinking about “self-driving cars” and “passenger star ratings”). Suppose we have an AI not good enough to be effectively deceptive. We can consider two classes of behaviour A: aligned behaviour that gets good reward, B obviously misaligned behaviour that gets good reward. My guess is that B≫A. We want our cars to go for good ratings while obeying a whole lot of side constraints—road rules, picking up passengers fairly, not cheating the system etc. If we have an AI where counting arguments are conclusive with regard to its eventual behaviour, I think we get a really bad taxi.
Now, maybe these can be dealt with by putting a lot more effort into the reward signal (penalising for road rule breaking, adding fares as well as star ratings, penalising attempts to cheat in every way you can imagine...). This would, at a minimum, entail a lot more effort than business-as-ususal reinforcement learning, and my guess is that if behviour counting arguments still apply then it flat out wouldn’t work. That’s what I mean by “insurmountable”.
Alternatively, maybe we deal with these problems by picking a prior that promotes A compared to B. In fact, this seems to be a more realistic way of constructing a self-driving taxi that gets good passenger ratings—first, make it a safe car, then adjust its behaviour (within limits!) to get better ratings from passengers.
Now, it’s possible that even though we solve the B≫A problem with better priors, with higher capability the set C of objectives that yield deceptively misaligned behaviour outnumbers B by so much that the better priors still don’t help. However, I think this is once again speculative and if if it’s an assumption underpinning your argument you need to say so.
(Sure, in some sense we agree that the assumption is required, but I think that’s a misleading way of putting it, but whatever)
Thank you for the detailed and lengthy explanation! I agree with your first point probably, it seems to me to be similar to what the shard theory people are exploring and yes this is a promising line of research which may if we are lucky overturn the default hypothesis that misaligned-but-deceptive AIs are most likely. I say similar things about the second point I guess. Both points are just basically saying “we don’t know what the prior is like” so sure but they aren’t positive arguments that the prior will be benign. Not sure whether I agree with the third point but anyhow it also just seems to be a warning that we are ignorant about the prior, not an argument that the prior is benign.
I don’t think I understand your more detailed argument that begins with “the point about nondescriptive goodharting.” I’m tired now so will go away but hopefully will return and try to think more deeply about it. I strongly encourage you to write up a post on it, with emphasis on clarity. I really hope you are right!