I would like to object to the variance explanation: in the Everett interpretation there was not even one collapse since the Big Bang. That means that every single quantum-ly random event from the start of the universe is already accounted in the variance. Over such timescales variance easily covers basically anything allowed by the laws: universes where humans exist, universes where they don’t, universes where humans exist but the Earth is shifted 1 meter to the right, universes where the start of Unix timestamp is defined to start in 1960 and not 1970, because some cosmic ray hit the brain of some engineer at exactly the right time, and certainly universes like ours but you pressed the “start training” button 0.153 seconds later. The variance doesn’t have to stem from how brains are affected by quanum fluctuations now, it can also stem from how brains are affected by regular macroscopical external stimuli that resulted from quantum fluctuations that happened billions of years ago.
silent-observer
I guess we are converging. I’m just pointing out flaws in this option, but I also can’t give a better solution off the top of my head. At least this won’t insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won’t include stuff like ants?), and that they can’t get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow.
(Also if with “ban all AI work” you’re referring to the open letter thing, that’s not really what it’s trying to do, but sure)
the reason it would have to avoid general harm is not the negative reward but rather the general bias for cooperation that applies to both copyable and non-copyable agents
How does non-harm follow from cooperation? If we remove the “negative reward for killing” part, what stops them from randomly killing agents (and everyone else believing it’s okay, so no punishment), if there is still enough other agents to cooperate with? Grudges? How do they work exactly for harm other than killing?
First of all, I do agree with your premises and the stated values, except for the assumption that “technically superior alien race” would be safe to create. If such an AI would have its own values other than helping humans/other AIs/whatever, then I’m not sure how I feel about it balancing its own internal rewards (like getting an enormous amount of virtual food that in its training environment could save billions of non-copyable agents) against real-world goals (like saving 1 actual human). We certainly want a powerful AI to be our ally rather than trying to contain it against its will, but I don’t think we should go with the “autonomous being” analogy far enough to make it chase its own unrelated goals.
Now about the technique itself. This is much better than the previous post. It is still a very much an unrealistic game (which I presume is obvious), and you can’t just take an agent from the game and put it into a real-life robot or something, there’s no “food” to be “gathered” here and blah-blah-blah. This is still an interesting experiment, however, as in trying to replicate human-like values in an RL agent, and I will treat it as such. I believe the most important part of your rules is this:
when a non-copyable agent dies, all agents get negative rewards
The result of this depend on the exact amount. If the negative reward is huge, then obviously agents will just protect the non-copyable ones and would never try to kill them. If some rogue agent tries, then it will be stopped (possibly killed) before it can. This is the basic value I would like AIs to have, the problem with that is that we can’t specify well enough what counts as “harm” in the real world. Even if AI won’t literally kill us, it can still do lots of horrible things. However, if we could just do that, the rest of the rules are redundant: for example, if something harms humans, then that thing should be stopped, doesn’t have to be an additional rule. If cooperating with some entity leads to less harm for humans, then that’s good, no need for additional rule. Just “minimize harm to humans” suffices.
If the negative reward is significant, but an individual AI can still get positive total by killing a non-copyable agent (or stealing their food or whatever), then we have Prisoner’s Dilemma situation. Presumably, if the agents are aligned, they will also try to stop this rogue AI from doing so, or at least punish it so that it won’t do that again. That will work well as long as the agent community is effective at detecting and punishing these rogue AIs (judicial system?). If the agent community is inefficient, then it is possible for an individual agent to gain reward by doing a bad action, so it will do so, if it thinks it can evade the punishment of others. “Behave altruistically unless you can find an opportunity to gain utility” is not that much more difficult than just “Behave altruistically always”, if we expect AIs to be somewhat smart, I thing we should expect them to know that deception is an option.
For the agent community to be efficient at detecting rogues, it has to have constant optimization pressure for that, i.e. constant threat of such rogues (or that part gets optimized away). Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don’t and get positive reward, and the regular AI community that does the punishing. The equilibrium would occur because optimizing deception skills and deception detection skills would require resources, and after some point that would be inefficient use of resources for both sides. Thus I believe this system would stabilize and proceed like that indefinitely. What we can learn from that, I don’t know, but that does reflect the situation we have in human civilization (some stable amount of criminals, and some stable amount of “good people”, who try their best to prevent crime from happening, but trying even harder is too expensive)
Ah, true. I just think this wouldn’t be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I’ll reply in more detail under your new post, it looks a lot better
So, no “reading” minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the “look at humans, try to understand what they want and do that” strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level “smiling humans = good”, which isn’t wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?
Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart’s Law and all that). I just can’t see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I’m not sure how that is useful for aligning the agents. Aren’t we supposed to make them not even want to deceive others, instead of trying to find a deception strategy and failing? (Also, I think even an average human isn’t that well aligned as we want our AIs to be. You wouldn’t want to give a random guy from the street nuclear codes, would you?)
it’s the agent’s job to convince other agents based on its behavior
So agents are rewarded for doing stuff that convinces others that they’re a “happy AI”, not necessarily actually being a “happy AI”? Doesn’t that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?
Like, suppose you start with a population of “happy AIs” that cooperate with each other, then if one of them realizes there’s a new way to deceive the others, there’s nothing to stop them until other agents adapt to this new kind of deception and learn to detect it? That feels like training an inherently unsafe and deceptive AI that also are extremely suspicious of others, not something “happy” and “friendly”
Except the point of Yudkowsky’s “friendly AI” is that they don’t have freedom to pick their own goals, they have the goals we set to them, and they are (supposedly) safe in a sense that “wiping out humanity” is not something we want, therefore it’s not something an aligned AI would want. We don’t replicate evolution with AIs, we replicate careful design and engineering that humans have used for literally everything else. If there is only a handful of powerful AIs with careful restrictions on what their goals can be (something we don’t know how to do yet), then your scenario won’t happen
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not? I do agree though that a swarm of cooperative AIs with different goals could be “safer” (if done right) than a single goal-directed agent.
This setup seems to get more and more complicated though. How are agents supposed to analyze “minds” of each other? I don’t think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
(On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other’s source code and prove things about it, then at least in the case of a simple binary prisoner’s dilemma you can make reasonable-looking agents that also don’t do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that’s something)
The problem is what do we count as an agent. Also, can’t a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can “vote out” anyone you don’t like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm.
(Also, are you sure we can just read out AI’s complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren’t any deceptive thoughts in parts you can’t read?)
Okay, so if that’s just a small component, then sure, first issue goes away (though I still have questions on how you’re gonna make this simulation realistic enough to just hook it up to an LLM or “something smart” and expect it to set coherent and meaningful goals in real life, though that’s more of a technical issue).
However, I believe there are still other issues with this appoach. The way you describe it makes me think it’s really similar to Axelrod’s Iterated Prisoner’s Dilemma tournament, and that did invent tit-for-tat strategy as one of the most successful ones. But that wasn’t the only successful strategy. For example there were strategies that were mostly tit-for-tat, but would defect if they could get away with it. If, for example, that still mostly results in tit-for-tat, except for some rare cases of it defecting when the agents in question are too stupid to “vote it out”, do we punish it?
Second, tit-for-tat is quite succeptible to noise. What if the trained agent misinterprets someone’s actions as “bad”, even if they in actuality did something innocent, like “affectionately pat their friend on the back”, which the AI interpreted as fighting? No matter how smart the AI gets, there still will be cases where it wrongly believes someone to be a liar, and so believes it has every justification to “deter” them. Do we really want even some small probability of that behaviour? How about an AI that doesn’t hurt humans unconditionally, and not only when it believes us to be “good agents”?[1]
Last thing, how does AI determine what is an “intelligent agent” and what is a “rock”? If there are explicit tags in the simulation for that thing, then how do you make sure every actual human in the real world gets tagged correctly? Also, do we count animals? Should animal abuse be enough justification for the AI to turn on the rage mode? What about accidentally stepping on an ant? And if you define “intelligent agent” as relative to the AI, when what do we do once it gets smart enough to rationally think of us like ants?
- ^
Somehow that reminds me of “I have been a good Bing, you have not been a good user” situation
- ^
I believe this would suffer from distributional shift in two different ways.
First, if the agents are supposed to scale up to the point where they can update their beliefs even after training, then we have a problem once the AI notices it can do pretty well without cooperating with humans in this new environment. If we allow agents to update their beliefs at runtime, then basically any reinforcement-learning-like preconditioning would be pretty much useless, I think. And if the agent can’t update its beliefs given new data then that can’t be an AGI.
Second, even if you solve the first problem somehow, there is a question of what exactly you mean by “cooperate” and “respect”. In the real world the choice is rarely binary between “cooperate” and “defect”, and there is often option of “do things that look like you’re cooperating while not actually putting much effort into it” (i.e. what most politicians do) or “actively try to deceive everyone looking to make them think you’re nice” (you don’t have to be actually smarter than everyone else for this, just smarter than everyone who cares enough to use their time to look at you closely). If you’re only giving the agents a binary choice, then that’s not realistic enough. If you’re giving them many choices, but also putting in an “objective” way to check if what they chose is “cooperative enough”, then all it takes is for the agents to deceive you, not their smarter AI colleagues. And if there’s no objective way to check that, then we’re back to describing human values with a utility function.
I think the application to the Hero With A Thousand Chances is partly incorrect because of a technicality. Consider the following hypothesis: there is a huge number of “parallel worlds” (not Everett branches, just thinking of different planets very far away is enough) each fighting the Dust. Every fight each of those worlds summons a randomly selected hero. Today that hero happened to be you. The world that happened to summon you has survived the encounter with the Dust 1079 times before you. The world next to it has already survived 2181, and the other one was destroyed during the 129th attempt.
This hypothesis explains the observation of the hero pretty well—you can’t get summoned to a world that’s destroyed or has successfully eliminated the Dust, so of course you get summoned to a world that is still fighting. As for the 1079 attempts before you, you can’t consider that a lot of evidence for fighting the Dust being easy, maybe you’re just 1080th entry in their database and can only be summoned for the 1080th attempt, there was no way for you to observe anything else. Under this hypothesis, you personally still have a pretty high chance of dying—there’s no angel helping you, that specific world really did get very lucky, as did lots of other worlds.
So, anthropic angel/”we’re in a fanfic” hypothesis explains observations just as well as this “many worlds, and you’re 1079th on their database” hypothesis, so they’re updated by the same amount, and at least for me the “many worlds” hypothesis has much higher prior than “I’m a fanfic character” hypothesis.
Note, this only holds from hero’s perspective: from Aerhien’s POV she really has observed herself survive 1079 times, which counts as a lot of evidence for the anthropic angel.