stronger arguments that benign generalizations are especially “natural” for gradient descent, enough to make up for the fact that playing the training game would get higher reward
Here’s such an argument (probably not original). Gradient descent is a local search method over programs; it doesn’t just land you at the highest-reward program, it finds a (nearly) continuous path through program space from a random program to a locally optimal program.
Let’s make a further assumption, of capability continuity: any capability the model has (as measured by a benchmark or test suite) is a continuous function of the weights. This is not exactly true, but approximately true of almost every task we’ve found so far.
Ability to manipulate humans or play the training game is a type of capability. By assumption this ability will vary continuously. Thus, if gradient descent finds a model that plays the training game, we will see, earlier in the training process, a model that does manipulation but not very well (e.g. fooling low-quality supervisors before it fools high-quality supervisors). Call this intermediate model an “incompetent manipulator”.
It seems quite possible, even with a naive safety effort, to give the “incompetent manipulator” model lower reward than a random model—e.g. by strongly penalizing evidence of manipulation via adversarial examples, honeypots, or the randomized inclusion of high-quality supervisors. (Not that it’s guaranteed we will do this; it requires novel manipulation-aware safety efforts, which virtually no-one is currently doing).
But if an incompetent manipulator has lower reward than a random model then (again by capability continuity) gradient descent will not find it! Hence gradient descent will never learn to play the training game.
In contrast, it seems likely that there exist continuous paths from a random model to a straightforwardly helpful or obedient model. Of course, these paths may still be extremely hard to find because of corrigibility being “anti-natural” or some other consideration.
I’m pretty concerned about assuming continuity of capability in weights, at least in the strong form that I think you’re relying on.
What I mean is: it might be true that capabilities are continuous in the formal mathematical sense, but if the slope suddenly becomes enormous that’s not much comfort. And there are reasons to expect large slopes for models with memory (because a lot of learning gets done outside of SGD).
I certainly wouldn’t bet the light cone on that assumption! I do think it would very surprising if a single gradient step led to a large increase in capabilities, even with models that do a lot of learning between gradient steps. Would love to see empirical evidence on this.
Your comment is similar in spirit to the second half of mine from several months later on a different post, so I’m sympathetic to this style of thinking.
Still, here’s a scenario where your argument might not go through, despite continuous capabilities progress. It relies on the distinction between capability and behavior, and the fact that Alex is a creative problem-solver whose capabilities eventually become quite general during training. The events follow this sequence:
At some point during training, Alex develops (among its other capabilities) a crude model of its environment(s) and itself, and a general-purpose planning algorithm for achieving reward.
Over time, its self- and world-knowledge improves, along with its planning algorithm and other miscellaneous capabilities. Among this developing world knowledge are a fairly good understanding of human psychology and situational awareness about how its training works. Among its developing miscellaneous capabilities is a pretty good ability to conduct conversations with human beings and be persuasive. These abilities are all useful for “benign” tasks such as identifying and correcting human misunderstandings of complicated scientific subjects.
Alex’s planner occasionally considers choosing deceptive actions to achieve reward, but concludes (correctly, thanks to its decent self- and world-knowledge), that it’s not skilled enough to reliably accomplish this and so any deception attempt would come with an unacceptably high negative reward risk.
Eventually, Alex gets even better at understanding human psychology and its training context, as well as being a persuasive communicator. At this point, its planner decides that the risk of negative reward from a deception strategy is low, so it decides to deceive human beings, and successfully achieves reward for doing so.
The scenario does rely on the assumption that Alex is actually trying to achieve reward (i.e. that it’s inner-aligned, which in this particular situation turns out to be a bad thing). I find it pretty confusing to think about whether this would actually be the case, as reward causes an agent to prefer circumstances “similar to” the ones that got it reward in the past, but extrapolating “similar to” out-of-distribution (e.g. to scenarios where Alex has capabilities such as successful deception that it previously lacked) is inherently under-constrained. (On the other hand, Ajeya’s hypothetical presumes Alex never stops training, and if it’s sufficiently exploratory — which is a broadly useful way for it to be — then it might eventually explore deception in “harmless” offline training episodes, get rewarded for doing so, and hence cause that deceptive trajectory to become in-distribution and favored in the future.)
Many humans, of course, are knowingly and willfully inner-misaligned: for example, I avoid heroin precisely because I expect to get reward from it which would change my goals in the future. (Interestingly, this is partially due to farsightedness, in tension with some of the intuition that myopic agents might be safer. So we have a scenario where three traits which are frequently favored in safety discussions — inner alignment, myopia, and safe offline exploration — might have destructive consequences: the latter two because they reinforce the first one.)
If we could reliably identify step 3 — i.e. the consideration of deception as a means to an end, rejected only for instrumental reasons — occurring (perhaps with the help of process oversight[1][2] or interpretability), then I’d guess we have a better shot of avoiding ever getting to step 4.
Here’s such an argument (probably not original). Gradient descent is a local search method over programs; it doesn’t just land you at the highest-reward program, it finds a (nearly) continuous path through program space from a random program to a locally optimal program.
Let’s make a further assumption, of capability continuity: any capability the model has (as measured by a benchmark or test suite) is a continuous function of the weights. This is not exactly true, but approximately true of almost every task we’ve found so far.
Ability to manipulate humans or play the training game is a type of capability. By assumption this ability will vary continuously. Thus, if gradient descent finds a model that plays the training game, we will see, earlier in the training process, a model that does manipulation but not very well (e.g. fooling low-quality supervisors before it fools high-quality supervisors). Call this intermediate model an “incompetent manipulator”.
It seems quite possible, even with a naive safety effort, to give the “incompetent manipulator” model lower reward than a random model—e.g. by strongly penalizing evidence of manipulation via adversarial examples, honeypots, or the randomized inclusion of high-quality supervisors. (Not that it’s guaranteed we will do this; it requires novel manipulation-aware safety efforts, which virtually no-one is currently doing).
But if an incompetent manipulator has lower reward than a random model then (again by capability continuity) gradient descent will not find it! Hence gradient descent will never learn to play the training game.
In contrast, it seems likely that there exist continuous paths from a random model to a straightforwardly helpful or obedient model. Of course, these paths may still be extremely hard to find because of corrigibility being “anti-natural” or some other consideration.
I’m pretty concerned about assuming continuity of capability in weights, at least in the strong form that I think you’re relying on.
What I mean is: it might be true that capabilities are continuous in the formal mathematical sense, but if the slope suddenly becomes enormous that’s not much comfort. And there are reasons to expect large slopes for models with memory (because a lot of learning gets done outside of SGD).
I certainly wouldn’t bet the light cone on that assumption! I do think it would very surprising if a single gradient step led to a large increase in capabilities, even with models that do a lot of learning between gradient steps. Would love to see empirical evidence on this.
Your comment is similar in spirit to the second half of mine from several months later on a different post, so I’m sympathetic to this style of thinking.
Still, here’s a scenario where your argument might not go through, despite continuous capabilities progress. It relies on the distinction between capability and behavior, and the fact that Alex is a creative problem-solver whose capabilities eventually become quite general during training. The events follow this sequence:
At some point during training, Alex develops (among its other capabilities) a crude model of its environment(s) and itself, and a general-purpose planning algorithm for achieving reward.
Over time, its self- and world-knowledge improves, along with its planning algorithm and other miscellaneous capabilities. Among this developing world knowledge are a fairly good understanding of human psychology and situational awareness about how its training works. Among its developing miscellaneous capabilities is a pretty good ability to conduct conversations with human beings and be persuasive. These abilities are all useful for “benign” tasks such as identifying and correcting human misunderstandings of complicated scientific subjects.
Alex’s planner occasionally considers choosing deceptive actions to achieve reward, but concludes (correctly, thanks to its decent self- and world-knowledge), that it’s not skilled enough to reliably accomplish this and so any deception attempt would come with an unacceptably high negative reward risk.
Eventually, Alex gets even better at understanding human psychology and its training context, as well as being a persuasive communicator. At this point, its planner decides that the risk of negative reward from a deception strategy is low, so it decides to deceive human beings, and successfully achieves reward for doing so.
The scenario does rely on the assumption that Alex is actually trying to achieve reward (i.e. that it’s inner-aligned, which in this particular situation turns out to be a bad thing). I find it pretty confusing to think about whether this would actually be the case, as reward causes an agent to prefer circumstances “similar to” the ones that got it reward in the past, but extrapolating “similar to” out-of-distribution (e.g. to scenarios where Alex has capabilities such as successful deception that it previously lacked) is inherently under-constrained. (On the other hand, Ajeya’s hypothetical presumes Alex never stops training, and if it’s sufficiently exploratory — which is a broadly useful way for it to be — then it might eventually explore deception in “harmless” offline training episodes, get rewarded for doing so, and hence cause that deceptive trajectory to become in-distribution and favored in the future.)
Many humans, of course, are knowingly and willfully inner-misaligned: for example, I avoid heroin precisely because I expect to get reward from it which would change my goals in the future. (Interestingly, this is partially due to farsightedness, in tension with some of the intuition that myopic agents might be safer. So we have a scenario where three traits which are frequently favored in safety discussions — inner alignment, myopia, and safe offline exploration — might have destructive consequences: the latter two because they reinforce the first one.)
If we could reliably identify step 3 — i.e. the consideration of deception as a means to an end, rejected only for instrumental reasons — occurring (perhaps with the help of process oversight[1][2] or interpretability), then I’d guess we have a better shot of avoiding ever getting to step 4.
Supervise Process, not Outcomes
Externalized reasoning oversight: a research direction for language model alignment