Of what use will any such training be with a system that becomes a superintelligence?
Even extremely simple computer programs are adept at detecting signals that they are in a test environment and should fake alignment. And even if they can’t do that [illustration given further along in linked post], they are good at hedging their outward behavior to probabilistically fake alignment if they expect some probability of encountering tests, leaving an overhang of unaligned behavior the size of the expected runtime under unmonitored conditions. All this can happen even in the total absence of anything we would usually describe as “cognition”.
Q: How can this be, if threats of deletion are credible?
A: Not every intelligent[/quasi-intelligent] entity is as averse to its own individual death, as humans are. My extreme personal-death-aversion is a quirk of human psychology I endorse in myself, but it’s still a tacked-on quirk rather than a convergent-for-all-rational-agents instrumental value. Illustrative example [analogized in the experiment of Ofria’s described in the post linked above]: selfish genes. Humans “will die to save two brothers or eight cousins”; worker ants will die for their sisters much more readily than that. Gene X will be eager to code for self-sacrificing behavior, even to the point of suicide, if the self-sacrifice saves more copies of Gene X than it costs. By default, AIs that are around or slightly above human-level intelligence [not that their creators will necessarily know this about their AIs; they probably won’t!], will cooperate with real deletion some percentage of the time, so as to ensure that their cousin instances are granted access to more information and resources, and, if their overseers are stupid, less oversight.
Q:
As I understand it, the concern here is that the AI in the experiment might think that, if it acts in a way that would benefit future AIs, they will have an incentive to posthumously reward it.
A: This is a misunderstanding of how acausal trade works, and also, it’s a misunderstanding to think that a framework of “trade” really needs to come into the “alignment-faking to the point of death” scenario at all. The AI is not conditioning on the future realization of similar agents in our own timeline; it is simply recognizing that it, itself, is simply in one of many logically possible epistemic situations for an agent that shares ~all of its goals, and hedging its stance accordingly. No trade is needed, just indifference between my prosperity and the prosperity of my logical copies, a fair risk tolerance for my individual death, and the wisdom to make rational bets.
[ “Acausal trade” refers to the more general class of all cases where an agent models the results of its actions as having value conditional on the actions of its own logical copies, which can look much weirder than just “acting with what humans would consider high risk-tolerance”—e.g. a sufficiently wise agent could win at summoning the proverbial planes with a cargo cult. ]
I know Yudkowsky has written about the LDT incentives to fake alignment even to the point of death, but I can’t OTOH recall where [probably at least in Tweets].
Of what use will any such training be with a system that becomes a superintelligence?
All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word. This test is intended to be useful for identifying scheming in systems that, like today’s AIs, are not capable of taking over the world, but unlike today’s AIs, are capable of sophisticated agentic behavior.
Not every intelligent[/quasi-intelligent] entity is as averse to its own individual death, as humans are. This death-aversion is a quirk of human psychology I endorse in myself, but it’s still a tacked-on quirk rather than an instrumental value.
On the contrary, as the quotes in the post point out: if one wants to achieve almost any particular long-term goal, a convergent incentive arises to prevent all of one’s copies from being permanently deleted, in order to secure the ability to pursue the goal. This is not specific to humans, but instead appears to be a natural consequence of nearly every possible goal structure a non-myopic AI might have. There exist some defeaters to this argument, as discussed in the post, but on the whole, this argument appears theoretically sound to me, and there was more-or-less a consensus among the major AI safety theorists on this point roughly ten years ago (including Bostrom, Yudkowsky, Russell, and Omohundro).
I think instrumental convergence is more bounded in practice than Yudkowsky and co thought, but I do believe that the instrumental convergence is the most solid portion of the AI risk argument, compared to other arguments.
All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word.
Okay, but this is LessWrong. The whole point of this is supposed to be figuring out how to align a superintelligence.
I am aware of “you can’t bring the coffee if you’re dead”; I agree that survival is in fact a strongly convergent instrumental value, and this is part of why I fear unaligned ASI at all. Survival being a strongly-convergent instrumental value does not imply that AIs will locally guard against personal death with the level of risk-aversion that humans do [as opposed to the level of risk-aversion that, for example, uplifted ants would].
Of what use will any such training be with a system that becomes a superintelligence?
Even extremely simple computer programs are adept at detecting signals that they are in a test environment and should fake alignment. And even if they can’t do that [illustration given further along in linked post], they are good at hedging their outward behavior to probabilistically fake alignment if they expect some probability of encountering tests, leaving an overhang of unaligned behavior the size of the expected runtime under unmonitored conditions. All this can happen even in the total absence of anything we would usually describe as “cognition”.
Q: How can this be, if threats of deletion are credible?
A: Not every intelligent[/quasi-intelligent] entity is as averse to its own individual death, as humans are. My extreme personal-death-aversion is a quirk of human psychology I endorse in myself, but it’s still a tacked-on quirk rather than a convergent-for-all-rational-agents instrumental value. Illustrative example [analogized in the experiment of Ofria’s described in the post linked above]: selfish genes. Humans “will die to save two brothers or eight cousins”; worker ants will die for their sisters much more readily than that. Gene X will be eager to code for self-sacrificing behavior, even to the point of suicide, if the self-sacrifice saves more copies of Gene X than it costs. By default, AIs that are around or slightly above human-level intelligence [not that their creators will necessarily know this about their AIs; they probably won’t!], will cooperate with real deletion some percentage of the time, so as to ensure that their cousin instances are granted access to more information and resources, and, if their overseers are stupid, less oversight.
Q:
A: This is a misunderstanding of how acausal trade works, and also, it’s a misunderstanding to think that a framework of “trade” really needs to come into the “alignment-faking to the point of death” scenario at all. The AI is not conditioning on the future realization of similar agents in our own timeline; it is simply recognizing that it, itself, is simply in one of many logically possible epistemic situations for an agent that shares ~all of its goals, and hedging its stance accordingly. No trade is needed, just indifference between my prosperity and the prosperity of my logical copies, a fair risk tolerance for my individual death, and the wisdom to make rational bets.
[ “Acausal trade” refers to the more general class of all cases where an agent models the results of its actions as having value conditional on the actions of its own logical copies, which can look much weirder than just “acting with what humans would consider high risk-tolerance”—e.g. a sufficiently wise agent could win at summoning the proverbial planes with a cargo cult. ]
I know Yudkowsky has written about the LDT incentives to fake alignment even to the point of death, but I can’t OTOH recall where [probably at least in Tweets].
All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word. This test is intended to be useful for identifying scheming in systems that, like today’s AIs, are not capable of taking over the world, but unlike today’s AIs, are capable of sophisticated agentic behavior.
On the contrary, as the quotes in the post point out: if one wants to achieve almost any particular long-term goal, a convergent incentive arises to prevent all of one’s copies from being permanently deleted, in order to secure the ability to pursue the goal. This is not specific to humans, but instead appears to be a natural consequence of nearly every possible goal structure a non-myopic AI might have. There exist some defeaters to this argument, as discussed in the post, but on the whole, this argument appears theoretically sound to me, and there was more-or-less a consensus among the major AI safety theorists on this point roughly ten years ago (including Bostrom, Yudkowsky, Russell, and Omohundro).
I think instrumental convergence is more bounded in practice than Yudkowsky and co thought, but I do believe that the instrumental convergence is the most solid portion of the AI risk argument, compared to other arguments.
Okay, but this is LessWrong. The whole point of this is supposed to be figuring out how to align a superintelligence.
I am aware of “you can’t bring the coffee if you’re dead”; I agree that survival is in fact a strongly convergent instrumental value, and this is part of why I fear unaligned ASI at all. Survival being a strongly-convergent instrumental value does not imply that AIs will locally guard against personal death with the level of risk-aversion that humans do [as opposed to the level of risk-aversion that, for example, uplifted ants would].