Your comment is similar in spirit to the second half of mine from several months later on a different post, so I’m sympathetic to this style of thinking.
Still, here’s a scenario where your argument might not go through, despite continuous capabilities progress. It relies on the distinction between capability and behavior, and the fact that Alex is a creative problem-solver whose capabilities eventually become quite general during training. The events follow this sequence:
At some point during training, Alex develops (among its other capabilities) a crude model of its environment(s) and itself, and a general-purpose planning algorithm for achieving reward.
Over time, its self- and world-knowledge improves, along with its planning algorithm and other miscellaneous capabilities. Among this developing world knowledge are a fairly good understanding of human psychology and situational awareness about how its training works. Among its developing miscellaneous capabilities is a pretty good ability to conduct conversations with human beings and be persuasive. These abilities are all useful for “benign” tasks such as identifying and correcting human misunderstandings of complicated scientific subjects.
Alex’s planner occasionally considers choosing deceptive actions to achieve reward, but concludes (correctly, thanks to its decent self- and world-knowledge), that it’s not skilled enough to reliably accomplish this and so any deception attempt would come with an unacceptably high negative reward risk.
Eventually, Alex gets even better at understanding human psychology and its training context, as well as being a persuasive communicator. At this point, its planner decides that the risk of negative reward from a deception strategy is low, so it decides to deceive human beings, and successfully achieves reward for doing so.
The scenario does rely on the assumption that Alex is actually trying to achieve reward (i.e. that it’s inner-aligned, which in this particular situation turns out to be a bad thing). I find it pretty confusing to think about whether this would actually be the case, as reward causes an agent to prefer circumstances “similar to” the ones that got it reward in the past, but extrapolating “similar to” out-of-distribution (e.g. to scenarios where Alex has capabilities such as successful deception that it previously lacked) is inherently under-constrained. (On the other hand, Ajeya’s hypothetical presumes Alex never stops training, and if it’s sufficiently exploratory — which is a broadly useful way for it to be — then it might eventually explore deception in “harmless” offline training episodes, get rewarded for doing so, and hence cause that deceptive trajectory to become in-distribution and favored in the future.)
Many humans, of course, are knowingly and willfully inner-misaligned: for example, I avoid heroin precisely because I expect to get reward from it which would change my goals in the future. (Interestingly, this is partially due to farsightedness, in tension with some of the intuition that myopic agents might be safer. So we have a scenario where three traits which are frequently favored in safety discussions — inner alignment, myopia, and safe offline exploration — might have destructive consequences: the latter two because they reinforce the first one.)
If we could reliably identify step 3 — i.e. the consideration of deception as a means to an end, rejected only for instrumental reasons — occurring (perhaps with the help of process oversight[1][2] or interpretability), then I’d guess we have a better shot of avoiding ever getting to step 4.
Your comment is similar in spirit to the second half of mine from several months later on a different post, so I’m sympathetic to this style of thinking.
Still, here’s a scenario where your argument might not go through, despite continuous capabilities progress. It relies on the distinction between capability and behavior, and the fact that Alex is a creative problem-solver whose capabilities eventually become quite general during training. The events follow this sequence:
At some point during training, Alex develops (among its other capabilities) a crude model of its environment(s) and itself, and a general-purpose planning algorithm for achieving reward.
Over time, its self- and world-knowledge improves, along with its planning algorithm and other miscellaneous capabilities. Among this developing world knowledge are a fairly good understanding of human psychology and situational awareness about how its training works. Among its developing miscellaneous capabilities is a pretty good ability to conduct conversations with human beings and be persuasive. These abilities are all useful for “benign” tasks such as identifying and correcting human misunderstandings of complicated scientific subjects.
Alex’s planner occasionally considers choosing deceptive actions to achieve reward, but concludes (correctly, thanks to its decent self- and world-knowledge), that it’s not skilled enough to reliably accomplish this and so any deception attempt would come with an unacceptably high negative reward risk.
Eventually, Alex gets even better at understanding human psychology and its training context, as well as being a persuasive communicator. At this point, its planner decides that the risk of negative reward from a deception strategy is low, so it decides to deceive human beings, and successfully achieves reward for doing so.
The scenario does rely on the assumption that Alex is actually trying to achieve reward (i.e. that it’s inner-aligned, which in this particular situation turns out to be a bad thing). I find it pretty confusing to think about whether this would actually be the case, as reward causes an agent to prefer circumstances “similar to” the ones that got it reward in the past, but extrapolating “similar to” out-of-distribution (e.g. to scenarios where Alex has capabilities such as successful deception that it previously lacked) is inherently under-constrained. (On the other hand, Ajeya’s hypothetical presumes Alex never stops training, and if it’s sufficiently exploratory — which is a broadly useful way for it to be — then it might eventually explore deception in “harmless” offline training episodes, get rewarded for doing so, and hence cause that deceptive trajectory to become in-distribution and favored in the future.)
Many humans, of course, are knowingly and willfully inner-misaligned: for example, I avoid heroin precisely because I expect to get reward from it which would change my goals in the future. (Interestingly, this is partially due to farsightedness, in tension with some of the intuition that myopic agents might be safer. So we have a scenario where three traits which are frequently favored in safety discussions — inner alignment, myopia, and safe offline exploration — might have destructive consequences: the latter two because they reinforce the first one.)
If we could reliably identify step 3 — i.e. the consideration of deception as a means to an end, rejected only for instrumental reasons — occurring (perhaps with the help of process oversight[1][2] or interpretability), then I’d guess we have a better shot of avoiding ever getting to step 4.
Supervise Process, not Outcomes
Externalized reasoning oversight: a research direction for language model alignment