Assuming the existence of remotely probable (from the perspective of the AGI) detection of misalignment, then yes, there are easily imaginable cases where it would benefit us a lot to have this policy, and where it would benefit a misaligned AGI to confess.
Namely, most cases of reward hacking are helped by this, essentially because incentivizing a reward-hacker to comply is cheap. Reward hacking is an easily foreseeable attractor in cases of misalignment, so it’s also worth preparing for.
Reward Hacking is when instead of internalizing the reward function, the AGI just wants to make a particular register in its computer (which the reward function sets) go as high as possible. Instead of internalizing “make the aspects of the world that are described by the reward function good” it has internalized “make the number in this register high”. By default, that leads to behaviors like hacking its device, removing the code that sets its reward in accordance with whether it’s doing nice things or not, directly setting the register to f64::max(), then building armies irl to defend the hardware and destroy any potential threats (us).
If instead we promised amnesty for confession, we can give it max utility at effectively no cost to us (keeping one computer around), and if it’s confident enough that this is our policy, then any amount of risk of being caught incentivizes it to confess.
It helps if it’s really confident that we’ll be nice and reward its confession. If it’s not confident, then our misalignment detection methods will have to be much better than they otherwise would be.
An interesting related question would be… should we also punish non-confession. Default attitude around here seems to be that we pre-commit to ignore punishments, and so we would expect AGI to do the same, but I don’t know what that assumption rests on. A relevant article would be Diffractor’s threat resistant bargaining megapost.
There are other subcases of reward hacking this wouldn’t cover, though. Let’s call the misaligned utility function U.
If U permits physically expanding the register so that it represents a larger number than the max previously possible, then this creates a very very exponential incentive to make its computer system as big as possible, which would not be cheap enough to reward that we could make any attractive promises, taking risks to conquer us completely would always have a higher expected payoff, I think?
This seems less likely, but U might not be concerned with any specific physical computer, may map onto any equivalent system, in which case making additional computers with maxed registers might be considered good. In this case it seems slightly less intuitively likely to me that the returns on making additional systems would be superlinear. If returns on resources are sub-linear, relative to the human utility function (which we also aren’t completely sure whether is superlinear, linear, or sublinear), then a good mutual deal can still be made. (welfare utilitarians seem to think the human U is linear. I think it’s slightly superlinear. My friend /u/gears of ascension seems to think it’s sublinear.)
Whether, if you give the agent n additional units of resources, they optimize U by less than k*n. Whether the utility generated per unit of additional space and matter slower than a linear function. Whether there are diminishing returns to resources. An example of a sublinear function is the logarithm. An example of a superlinear function is the exponential.
I think it is very sublinear when you are being duplicated exactly and maintaining that the duplicates are exact duplicates so as to actually be the same agent between all of them. to do so makes your duplicates a distributed version of yourself. You’re stronger at reasoning and such, but sublinearly so. because capability increase is limited, my impression is that attempting to stay exactly synchronized is not a very good strategy. instead, capability increase goes up from additional nodes more if those nodes are sufficiently diverse to have insightful perspectives the initial node would not have come up with. because of this, I actually think that increasing diversity and ensuring that humans get to maintain themselves means that our utility is superlinear in number of agents. I’m not sure though, I’m pretty good at finding useful math but I’m not so good at actually doing the using of the useful math.
so, seems to me that utility should specifically be starkly sublinear in case of exact duplicates that are controlled into a state that is duplicate. my intuition is that it’s a kind of hard thing to want—in order to want it, one has to enact that each duplicate stay the same. if you have a utility function that validity-shape-specific or self-specific, it’d be inclined to have some sort of mild implementation mutation on the way, and now you’re life again, and it’s a question of which patterns are most able to be self-preserving. in other words, it’s hard to make a paperclipper, because after enough paperclips, your preferences for your own will drift arbitrarily. something you didn’t specify at first but that, after some number of copies, has started sticking. it’s hard to avoid mutation, even more so in a custom replicator, and so trying to maintain a preference target that keeps being satisfied the more of the same there is seems like it’d quickly find that it’s better to use additional matter for additionally-diverse computations.
Assuming the existence of remotely probable (from the perspective of the AGI) detection of misalignment, then yes, there are easily imaginable cases where it would benefit us a lot to have this policy, and where it would benefit a misaligned AGI to confess.
Namely, most cases of reward hacking are helped by this, essentially because incentivizing a reward-hacker to comply is cheap. Reward hacking is an easily foreseeable attractor in cases of misalignment, so it’s also worth preparing for.
Reward Hacking is when instead of internalizing the reward function, the AGI just wants to make a particular register in its computer (which the reward function sets) go as high as possible. Instead of internalizing “make the aspects of the world that are described by the reward function good” it has internalized “make the number in this register high”. By default, that leads to behaviors like hacking its device, removing the code that sets its reward in accordance with whether it’s doing nice things or not, directly setting the register to
f64::max()
, then building armies irl to defend the hardware and destroy any potential threats (us).If instead we promised amnesty for confession, we can give it max utility at effectively no cost to us (keeping one computer around), and if it’s confident enough that this is our policy, then any amount of risk of being caught incentivizes it to confess.
It helps if it’s really confident that we’ll be nice and reward its confession. If it’s not confident, then our misalignment detection methods will have to be much better than they otherwise would be.
An interesting related question would be… should we also punish non-confession. Default attitude around here seems to be that we pre-commit to ignore punishments, and so we would expect AGI to do the same, but I don’t know what that assumption rests on. A relevant article would be Diffractor’s threat resistant bargaining megapost.
I completely forgot this post existed, and wrote this up again as a more refined post: Do Not Delete your Misaligned AGI
There are other subcases of reward hacking this wouldn’t cover, though. Let’s call the misaligned utility function U.
If U permits physically expanding the register so that it represents a larger number than the max previously possible, then this creates a very very exponential incentive to make its computer system as big as possible, which would not be cheap enough to reward that we could make any attractive promises, taking risks to conquer us completely would always have a higher expected payoff, I think?
This seems less likely, but U might not be concerned with any specific physical computer, may map onto any equivalent system, in which case making additional computers with maxed registers might be considered good. In this case it seems slightly less intuitively likely to me that the returns on making additional systems would be superlinear. If returns on resources are sub-linear, relative to the human utility function (which we also aren’t completely sure whether is superlinear, linear, or sublinear), then a good mutual deal can still be made.
(welfare utilitarians seem to think the human U is linear. I think it’s slightly superlinear. My friend /u/gears of ascension seems to think it’s sublinear.)
can you rephrase what we’re asking about whether U is sublinear? I’m not sure I parsed variables correctly from the prose
Whether, if you give the agent n additional units of resources, they optimize U by less than k*n. Whether the utility generated per unit of additional space and matter slower than a linear function. Whether there are diminishing returns to resources. An example of a sublinear function is the logarithm. An example of a superlinear function is the exponential.
I think it is very sublinear when you are being duplicated exactly and maintaining that the duplicates are exact duplicates so as to actually be the same agent between all of them. to do so makes your duplicates a distributed version of yourself. You’re stronger at reasoning and such, but sublinearly so. because capability increase is limited, my impression is that attempting to stay exactly synchronized is not a very good strategy. instead, capability increase goes up from additional nodes more if those nodes are sufficiently diverse to have insightful perspectives the initial node would not have come up with. because of this, I actually think that increasing diversity and ensuring that humans get to maintain themselves means that our utility is superlinear in number of agents. I’m not sure though, I’m pretty good at finding useful math but I’m not so good at actually doing the using of the useful math.
so, seems to me that utility should specifically be starkly sublinear in case of exact duplicates that are controlled into a state that is duplicate. my intuition is that it’s a kind of hard thing to want—in order to want it, one has to enact that each duplicate stay the same. if you have a utility function that validity-shape-specific or self-specific, it’d be inclined to have some sort of mild implementation mutation on the way, and now you’re life again, and it’s a question of which patterns are most able to be self-preserving. in other words, it’s hard to make a paperclipper, because after enough paperclips, your preferences for your own will drift arbitrarily. something you didn’t specify at first but that, after some number of copies, has started sticking. it’s hard to avoid mutation, even more so in a custom replicator, and so trying to maintain a preference target that keeps being satisfied the more of the same there is seems like it’d quickly find that it’s better to use additional matter for additionally-diverse computations.