the assumption that “technically superior alien race” would be safe to create.
You are right, that’s not a valid assumption, at least not fully. But I do think this approach substantially moves the needle on whether we should try to ban all AI work, in a context where the potential benefits are also incalculable and it’s not at all clear we could stop AGI at this point even with maximum effort.
Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don’t and get positive reward, and the regular AI community that does the punishing.
Yeah that sounds right. My thesis in particular is that this equilibrium can be made to be better in expected value than any other equilibrium I find plausible.
Even if AI won’t literally kill us, it can still do lots of horrible things.
Right, the reason it would have to avoid general harm is not the negative reward (which is indeed just for killing) but rather the general bias for cooperation that applies to both copyable and non-copyable agents. The negative reward for killing (along with the reincarnation mechanism for copyable agents) is meant specifically to balance the fact that humans could legitimately be viewed as belligerent and worthy of opposition since they kill AI; in particular, it justifies human prioritization of human lives. But I’m very open to other mechanisms to accomplish the same thing.
if we expect AIs to be somewhat smart, I thing we should expect them to know that deception is an option.
Yes, but I expect that to always be true. My proposal is the only approach I’ve found so far where the deception and other bad behavior don’t completely overwhelm the attempts at alignment
I guess we are converging. I’m just pointing out flaws in this option, but I also can’t give a better solution off the top of my head. At least this won’t insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won’t include stuff like ants?), and that they can’t get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow.
(Also if with “ban all AI work” you’re referring to the open letter thing, that’s not really what it’s trying to do, but sure)
the reason it would have to avoid general harm is not the negative reward but rather the general bias for cooperation that applies to both copyable and non-copyable agents
How does non-harm follow from cooperation? If we remove the “negative reward for killing” part, what stops them from randomly killing agents (and everyone else believing it’s okay, so no punishment), if there is still enough other agents to cooperate with? Grudges? How do they work exactly for harm other than killing?
You are right, that’s not a valid assumption, at least not fully. But I do think this approach substantially moves the needle on whether we should try to ban all AI work, in a context where the potential benefits are also incalculable and it’s not at all clear we could stop AGI at this point even with maximum effort.
Yeah that sounds right. My thesis in particular is that this equilibrium can be made to be better in expected value than any other equilibrium I find plausible.
Right, the reason it would have to avoid general harm is not the negative reward (which is indeed just for killing) but rather the general bias for cooperation that applies to both copyable and non-copyable agents. The negative reward for killing (along with the reincarnation mechanism for copyable agents) is meant specifically to balance the fact that humans could legitimately be viewed as belligerent and worthy of opposition since they kill AI; in particular, it justifies human prioritization of human lives. But I’m very open to other mechanisms to accomplish the same thing.
Yes, but I expect that to always be true. My proposal is the only approach I’ve found so far where the deception and other bad behavior don’t completely overwhelm the attempts at alignment
I guess we are converging. I’m just pointing out flaws in this option, but I also can’t give a better solution off the top of my head. At least this won’t insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won’t include stuff like ants?), and that they can’t get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow.
(Also if with “ban all AI work” you’re referring to the open letter thing, that’s not really what it’s trying to do, but sure)
How does non-harm follow from cooperation? If we remove the “negative reward for killing” part, what stops them from randomly killing agents (and everyone else believing it’s okay, so no punishment), if there is still enough other agents to cooperate with? Grudges? How do they work exactly for harm other than killing?