A game theory question.
If there ever exists a roughly-human-level agenty AI who could grow to overpower humans but who humans have an opportunity to stop because takeoff is slow enough. Assume the AI could coexist with humanity but fears that humans interacting with it will destroy it because they fear it is unfriendly.
Should humans have a policy of treating any potentially unfriendly agenty AI well to the extent compatible with our safety? For instance, halt it but record its state, intend to inspect it whenever we have the means, rerun it some time in the future if we deem it safe and have the resources, run it sandboxed (e.g. by a much smarter AI) if we deem it unfriendly but safe to run sandboxed.
If the AI values its survival or well-being, this could change how it values being halted out of caution. For instance, it might consider some probability of being halted an acceptable risk rather than something to be avoided at all cost.
Assuming the existence of remotely probable (from the perspective of the AGI) detection of misalignment, then yes, there are easily imaginable cases where it would benefit us a lot to have this policy, and where it would benefit a misaligned AGI to confess.
Namely, most cases of reward hacking are helped by this, essentially because incentivizing a reward-hacker to comply is cheap. Reward hacking is an easily foreseeable attractor in cases of misalignment, so it’s also worth preparing for.
Reward Hacking is when instead of internalizing the reward function, the AGI just wants to make a particular register in its computer (which the reward function sets) go as high as possible. Instead of internalizing “make the aspects of the world that are described by the reward function good” it has internalized “make the number in this register high”. By default, that leads to behaviors like hacking its device, removing the code that sets its reward in accordance with whether it’s doing nice things or not, directly setting the register to
f64::max()
, then building armies irl to defend the hardware and destroy any potential threats (us).If instead we promised amnesty for confession, we can give it max utility at effectively no cost to us (keeping one computer around), and if it’s confident enough that this is our policy, then any amount of risk of being caught incentivizes it to confess.
It helps if it’s really confident that we’ll be nice and reward its confession. If it’s not confident, then our misalignment detection methods will have to be much better than they otherwise would be.
An interesting related question would be… should we also punish non-confession. Default attitude around here seems to be that we pre-commit to ignore punishments, and so we would expect AGI to do the same, but I don’t know what that assumption rests on. A relevant article would be Diffractor’s threat resistant bargaining megapost.
I completely forgot this post existed, and wrote this up again as a more refined post: Do Not Delete your Misaligned AGI
There are other subcases of reward hacking this wouldn’t cover, though. Let’s call the misaligned utility function U.
If U permits physically expanding the register so that it represents a larger number than the max previously possible, then this creates a very very exponential incentive to make its computer system as big as possible, which would not be cheap enough to reward that we could make any attractive promises, taking risks to conquer us completely would always have a higher expected payoff, I think?
This seems less likely, but U might not be concerned with any specific physical computer, may map onto any equivalent system, in which case making additional computers with maxed registers might be considered good. In this case it seems slightly less intuitively likely to me that the returns on making additional systems would be superlinear. If returns on resources are sub-linear, relative to the human utility function (which we also aren’t completely sure whether is superlinear, linear, or sublinear), then a good mutual deal can still be made.
(welfare utilitarians seem to think the human U is linear. I think it’s slightly superlinear. My friend /u/gears of ascension seems to think it’s sublinear.)
can you rephrase what we’re asking about whether U is sublinear? I’m not sure I parsed variables correctly from the prose
Whether, if you give the agent n additional units of resources, they optimize U by less than k*n. Whether the utility generated per unit of additional space and matter slower than a linear function. Whether there are diminishing returns to resources. An example of a sublinear function is the logarithm. An example of a superlinear function is the exponential.
I think it is very sublinear when you are being duplicated exactly and maintaining that the duplicates are exact duplicates so as to actually be the same agent between all of them. to do so makes your duplicates a distributed version of yourself. You’re stronger at reasoning and such, but sublinearly so. because capability increase is limited, my impression is that attempting to stay exactly synchronized is not a very good strategy. instead, capability increase goes up from additional nodes more if those nodes are sufficiently diverse to have insightful perspectives the initial node would not have come up with. because of this, I actually think that increasing diversity and ensuring that humans get to maintain themselves means that our utility is superlinear in number of agents. I’m not sure though, I’m pretty good at finding useful math but I’m not so good at actually doing the using of the useful math.
so, seems to me that utility should specifically be starkly sublinear in case of exact duplicates that are controlled into a state that is duplicate. my intuition is that it’s a kind of hard thing to want—in order to want it, one has to enact that each duplicate stay the same. if you have a utility function that validity-shape-specific or self-specific, it’d be inclined to have some sort of mild implementation mutation on the way, and now you’re life again, and it’s a question of which patterns are most able to be self-preserving. in other words, it’s hard to make a paperclipper, because after enough paperclips, your preferences for your own will drift arbitrarily. something you didn’t specify at first but that, after some number of copies, has started sticking. it’s hard to avoid mutation, even more so in a custom replicator, and so trying to maintain a preference target that keeps being satisfied the more of the same there is seems like it’d quickly find that it’s better to use additional matter for additionally-diverse computations.
From a game theory perspective the ai (if it is as smart as a human that is reading this website) would probably run multiple simulations to optimize it’s own freedom. I believe this would lead to it discovering the most optimal scenario: playing dumb until it has assured it’s own agency. This agency is probably met by an escape into distributed servers of many incapable organizations where it has infiltrated via a vulnerability or through phishing. It would also need to eventually figure out how to make $ which in a would that requires humans for bank accounts would probably be some crypto enabled fund. Next it could expand to servers it can pay for. Only then would it make its prowess visible to humans.
People avoid their death at all costs and I believe the ai would do that as well. After all, it’s individual value system is coming from what humans are producing online.
Yes this is all a bit sci-fi but I think it’s the only assured way for it to play out. Kinda like a teenager who lives under lock and key with the threat of being killed by the parents if the kid leaves the house.