Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
I’m concerned about overloading the word “benign” with a new concept (mainly not seeking power outside the box, if I understand correctly) that doesn’t match either informal usage or a previous technical definition. In particular this “benign” AGI (in the limit) will hack the operator’s mind to give itself maximum reward, if that’s possible, right?
The system seems limited to answering questions that the human operator can correctly evaluate the answers to within a single episode (although I suppose we could make the episodes very long and allow multiple humans into the room to evaluate the answer together). (We could ask it other questions but it would give answers that sound best to the operator rather than correct answers.) If you actually had this AGI today, what questions would you ask it?
If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode. Maybe not a big deal but it’s kind of a counter-example to “We argue that our algorithm produces
an AGI that, even if it became omniscient, would continue to
accomplish whatever task we wanted, instead of hijacking its
reward, eschewing its task, and neutralizing threats to it, even
if it saw clearly how to do exactly that.”
(Feel free to count this as some number of comments between 1 and 4, since some of the above items are related. Also I haven’t read most of the math yet and may have more comments and questions once I understood the motivations and math better.)
4. If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode...
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
1. Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
3. The system seems limited to answering questions that the human operator can correctly evaluate...
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).
Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
This seems useful if you could get around the mind hacking problem, but how would you do that?
On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
I don’t know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I’m not sure what else you might have in mind.
This seems useful if you could get around the mind hacking problem, but how would you do that?
On second thought, (even assuming away the mind hacking problem) if you ask about “how to make a safe unbounded AGI” and “what’s wrong with the answer” in separate episodes, you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn’t enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you’d have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you’ve definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of “AI Safety via Debate” was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search.
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
I suspect that AI Safety via Debate could be benign for certain decisions (like whether to release an AI) if we were to weight the debate more towards the safer option.
Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
… but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I’m not sure how to tell what will really happen short of actually having a superintelligent AI to test with.
I don’t know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I’m not sure what else you might have in mind.
You’re right (see the redaction). Why Wei is right. Here’s an unpolished idea though: they could do something like minimax. Instead of simulating the other agent, they could model the environment as responding to a pair of actions. For inference, they would have the history of their opponent’s actions as well, and for planning, they could pick their action to maximize their objective assuming the other agent’s actions are maximally inconvenient.
So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B’s previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can’t ask for any counter-arguments from B without risking being mind hacked by B.
(I realize this is a problem in “AI Safety via debate” as well, but I’m asking you since you’re here and Geoffrey Irving isn’t. :)
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for pointing it out.
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
I don’t understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn’t read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they’re trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
So actually I framed my point above wrong: “demanding that A use their words” could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).
But your original point was “why doesn’t A just claim B is mind-hacking” not “why doesn’t B just mind-hack”? The answer to that point was “demand A use their words rather than negotiate an end to the conversation” or more moderately, “75%-demand that A do this.”
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
Oh, I see, I didn’t understand “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).
I don’t think the “AI safety via debate” paper actually makes arguments for this assumption (at least I couldn’t find where it does). Do you have reasons to think it’s true, or ideas for how to verify that it’s true, short of putting a human in a BoMAI?
Yeah… I don’t have much to add here. Let’s keep thinking about this. I wonder if Paul is more bullish on the premise that “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” than I am?
Recall that this idea was to avoid
essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on
If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn’t originally what brought us here.
This seems useful if you could get around the mind hacking problem, but how would you do that?
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
Instead of hypnosis, I’m more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn’t optimized for by the AI).
If you try to solve the mind hacking problem iteratively, you’re more likely to find a way to get useful answers out of the system, but you’re also more likely to hit upon an existentially catastrophic form of mind hacking.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in “AI Safety via Debate” if I remember correctly.
the operator might feel a desire afterwards to spread this world view
It is plausible to me that there is selection pressure to make the operator “devoted” in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills.
Just to step back and frame this conversation, we’re discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don’t. (Again, that is just the default before we look closer). This doesn’t mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
Yeah, the threat model I have in mind isn’t the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity’s potential (which is part of the definition of “existential risk”). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good.
Empirical analysis might not be useless here in evaluating the “surprisingness” of XX. I don’t think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good.
I’m adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is “Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design.” the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead “Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer.”
Hm. This doesn’t seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment)
For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1...
but I don’t imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn’t seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere).
Maybe other people can weigh in on this, and we can come back to it.
I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”
The best alternative to “benign” that I could come up with is “unambitious”. I’m not very good at this type of thing though, so maybe ask around for other suggestions or indicate somewhere prominent that you’re interested in giving out a prize specifically for this?
What do you think about “aligned”? (in the sense of having goals which don’t interfere with our own, by being limited in scope to the events of the room)
“We’re talking about what you do, not what you do.”
“Suppose you give us a new toy/summarized toy, something like a room, an inside-view view thing, and ask them to explain what you desire.”
“Ah,” you reply, “I’m asking what you think about how your life would go if you lived it way up until now. I think I would be interested in hearing about that.
“Oh? I’d think about that, and I might want to think about it a bit more. So I would say, for example, that you might want to give someone a toy/summarized toy by the same criteria as other people in the room and make them play the role of toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing.
It seems like the answer would be quite different.
“Oh, then,” you say, “That seems like too much work. Let me try harder!”
“What about that—what does this all just—so close to the real thing, don’t you think? And that I shouldn’t think such things are real?”
“Not exactly. But don’t you think there should be any al variative reasons why this is always so hard, or that any al variative reasons are not just illuminative, or couldn’t be found some other way?”
“That’s not exactly how I would put it. I’m fully closed about it. I’m still working on it. I don’t know whether I could get this outcome without spending so much effort on finding this particular method of doing something, because I don’t think it would happen without them trying it, so it’s not like they’re trying to determine whether that outcome is real or not.”
“Ah...” said your friend, staring at you in horror. “So, did you ever even think of the idea, or did it just
A second comment, but it doesn’t seem worth an answer: it can’t be an explicit statement of what would happen if you tried this, and it seems to me unlikely that my initial reaction when it was presented in the first place was insincere, so it seems like a really good idea to let it propagate in your mind a little. I’m hoping a lot of good ideas do become useful this time.
There’s still an existential risk in the sense that the AGI has an incentive to hack the operator to give it maximum reward, and that hack could have powerful effects outside the box (even though the AI hasn’t optimized it for that purpose), for example it might turn out to be a virulent memetic virus. Of course this is much less risky than if the AGI had direct instrumental goals outside the box, but “benign” and “not existentially dangerous” both seem to be claiming a bit too much. I’ll think about what other term might be more suitable.
The first nuclear reaction initiated an unprecedented temperature in the atmosphere, and people were right to wonder whether this would cause the atmosphere to ignite. The existence of a generally intelligent agent is likely to cause unprecedented mental states in humans, and we would be right to wonder whether that will cause an existential catastrophe. I think the concern of “could have powerful effects outside the box” is mostly captured by the unprecedentedness of this mental state, since the mental state is not selected to have those side effects. Certainly there is no way to rule out side-effects of inside-the-box events, since these side effects are the only reason it’s useful. And there is also certainly no way to rule out how those side effects “might turn out to be,” without a complete view of the future.
Would you agree that unprecedentedness captures the concern?
From the formal description of the algorithm, it looks like you use a universal prior to pick k, and then allow the kth Turing machine to run for ℓ steps, but don’t penalize the running time of the machine that outputs k. Is that right? That didn’t match my intuitive understanding of the algorithm, and seems like it would lead to strange outcomes, so I feel like I’m misunderstanding.
Yes this is correct. If you use the same bijection consistently from strings to natural numbers, it looks a little more intuitive than if you don’t. The universal prior picks k (the number) by outputting k as a string. The kth Turing machine is the Turing machine described by k as a string. So you end up looking at the Kolmogorov complexity of the description of the Turing machine. So the construction of the description of the world-model isn’t time-penalized. This doesn’t change the asymptotic result, so I went with the more familiar K(x) rather than translating this new speed prior into measure over finite strings, which would require some more exposition, but I agree with you it feels like there might be some strange outcomes “before the limit” as a result of this approach: namely, the code on the UTM that outputs the description of the world-model-Turing-machine will try to do as much of the computation as possible in advance, by computing the description of an speed-optimized Turing machine for when the actions start coming.
The other reasonable choices here instead of K(x) are S(x) (constructed to be like the new speed prior here) and—ℓ(x)the length of x. But ℓ(x) basically tells you that a Turing machine with fewer states is simpler, which would lead to a measure over H∞ that is dominated by world-models that are just universal Turing machines, which defeats the purpose of doing maximum a posteriori instead of a Bayes mixture. The way this issue appears in the proof renders the Natural Prior Assumption less plausible.
This invalidates some of my other concerns, but also seems to mean things are incredibly weird at finite times. I suspect that you’ll want to change to something less extreme here.
(I might well be misunderstanding something, apologies in advance.)
Suppose the “intended” physics take at least 1E15 steps to run on the UTM (this is a conservative lower bound, since you have to simulate the human for the whole episode). And suppose β<0.999 (I think you need β much lower than this). Then the intended model gets penalized by at least exp(1E12) for its slowness.
For almost the same description complexity, I could write down physics + “precompute the predictions for the first N episodes, for every sequence of possible actions/observations, and store them in a lookup table.” This increases the complexity by a few bits, some constant plus K(N|physics), but avoids most of the computation. In order for the intended physics to win, i.e. in order for the “speed” part of the speed prior to do anything, we need the complexity of this precomputed model to be at least 1E12 bits higher than the complexity of the fast model.
That appears to happen only once N > BB(1E12). Does that seem right to you?
We could talk about whether malign consequentialists also take over at finite times (I think they probably do, since the “speed” part of the speed prior is not doing any work until after BB(1E12) steps, long after the agent becomes incredibly smart), but it seems better to adjust the scheme first.
Using the speed prior seems more reasonable, but I’d want to know which version of the speed prior and which parameters, since which particular problem bites you will depend on those choices. And maybe to save time, I’d want to first get your take on whether the proposed version is dominated by consequentialists at some finite time.
Yes. I recall thinking about precomputing observations for various actions in this phase, but I don’t recall noticing how bad the problem was not in the limit.
your take on whether the proposed version is dominated by consequentialists at some finite time.
This goes in the category of “things I can’t rule out”. I say maybe 1⁄5 chance it’s actually dominated by consequentialists (that low because I think the Natural Prior Assumption is still fairly plausible in its original form), but for all intents and purposes, 1⁄5 is very high, and I’ll concede this point.
I’d want to know which version of the speed prior and which parameters
2−K(s)(1+ε) is a measure over binary strings. Instead, let’s try ∑p∈{0,1}∗:U(p)=s2−ℓ(p)βcT(U,p), where ℓ(p) is the length of p, T(U,p) is the time it takes to run p on U, and c is a constant. If there were no cleverer strategy than precomputing observations for all the actions, then c could be above |A|−md, where d is the number of episodes we can tolerate not having a speed prior for. But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then c would have to be above 1/d.
I say maybe 1⁄5 chance it’s actually dominated by consequentialists
Do you get down to 20% because you think this argument is wrong, or because you think it doesn’t apply?
What problem do you think bites you?
What’s β? Is it O(1) or really tiny? And which value of c do you want to consider, polynomially small or exponentially small?
But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then c would have to be above 1/d.
Wouldn’t they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss? Predicting what BoMAI will do seems likely to be much easier than that.
Do you get down to 20% because you think this argument is wrong, or because you think it doesn’t apply?
You argument is about a Bayes mixture, not a MAP estimate; I think the case is much stronger that consequentialists can take over a non-trivial fraction of a mixture. I think that the methods with consequentialists discover for gaining weight in the prior (before the treacherous turn) are mostly likely to be elegant (short description on UTM), and that is the consequentialists’ real competition; then [the probability the universe they live in produces them with their specific goals]or [the bits to directly specify a consequentialist deciding to to do this] set them back (in the MAP context).
I don’t see why their methods would be elegant. In particular, I don’t see why any of {the anthropic update, importance weighting, updating from the choice of universal prior} would have a simple form (simpler than the simplest physics that gives rise to life).
I don’t see how MAP helps things either—doesn’t the same argument suggest that for most of the possible physics, the simplest model will be a consequentialist? (Even more broadly, for the universal prior in general, isn’t MAP basically equivalent to a random sample from the prior, since some random model happens to be slightly more compressible?)
Yeah I think we have different intuitions here; are we at least within a few bits of log-odds disagreement? Even if not, I am not willing to stake anything on this intuition, so I’m not sure this is a hugely important disagreement for us to resolve.
I don’t see how MAP helps things either
I didn’t realize that you think that a single consequentialist would plausibly have the largest share of the posterior. I assumed your beliefs were in the neighborhood of:
it seems plausible that the weight of the consequentialist part is in excess of 1/million or 1/billion
(from your original post on this topic). In a Bayes mixture, I bet that a team of consequentialists that collectively amount to 1⁄10 or even 1⁄50 of the posterior could take over our world. In MAP, if you’re not first, you’re last, and more importantly, you can’t team up with other consequentialist-controlled world-models in the mixture.
Wouldn’t they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss?
Would you mind explaining what the retracted part was? Even if it was a mistake, pointing it out might be useful to others thinking along the same lines.
Are there UDT-ish concerns with breaking isolation of episodes? For example, if the AI receives a low reward at the beginning of episode 117, does it have an incentive to manipulate the external world to make episode 117 happen many times somehow, with most of these times giving it a higher reward? For another example, can the AI at episode 117 realize that it’s in a game theory situation with the AI at episodes 116 and 118 and trade rewards with them acausally, leading to long-term goal directed behavior?
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
It’s actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there’s no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)
does it have an incentive to manipulate the external world to make episode 117 happen many times somehow
For any given world-model, episode 117 is just a string of actions on the input tape, and observations and rewards on the output tape (positions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
Okay so I think you could construct a world-model that reflects this sort of reasoning, where it associates reward with the reward provided to a randomly sampled instance of its algorithm in the world in a way that looks like this. But the “malign output that would result in additional invocations of itself” would require the operator to leave the room, so this has the same form as, for example, ν†. At this point, I think we’re no longer considering anything that sounds like “episode 117 happening twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the rewards/observations provided to the two separate instances ever diverge.
The reward belongs to a finite set of rationals between 0 and 1.
Once the AI becomes sufficiently powerful, it is definitely going to hack the operator any time it expects the operator to give a reward less than 1. So the operator’s input is really binary, at least once the AI has learned an accurate model. Given that, why allow non-binary rewards at all? Is it just supposed to provide faster learning early on?
Along similar lines: once the AI has learned an accurate model, why would we expect it to ever provide anything useful at all, rather than just hacking the operator all day? Do we think that hacking the human is likely to be harder than obtaining perfect rewards every time without hacking the human? Seems like that would depend very heavily on the problem at hand, and on the operator’s feedback strategy.
To put it differently: this setup will not provide a solution to any problem which is more difficult than hacking the human operator.
To put it differently: this setup will not provide a solution to any problem which is more difficult than hacking the human operator.
Yep. I actually expect it to be pretty hard to hack the operator compared to other tasks, but more importantly, if it turns out to be easier, we can tinker with the setup until hacking the operator becomes very hard, since a mistake here isn’t an existential risk. For example, we could add a monitor who observes the operator, and can set the rewards of the episode to 0 if he observes anything suspicious; the box would include them both, but the operator couldn’t easily communicate to the monitor, and if they tried, the monitor could immediately end the episode.
This comment has a recap of other parts of the conversation on this topic, alongside a more detailed proposal.
So I think this proposal is a specific instance of a general principle: wireheading acts as a safety valve. As long as the AI is prone to wireheading, and you leave it some path to wireheading, the AI will never bother with anything more difficult than that path. In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
This kind of setup creates a direct trade-off between safety and capability: the harder it is to hack the human operator, the more likely that some other escape channel is easier—manipulating electrical circuitry to push atoms around, or some weird thing like that. The AI will become an existential threat if you send in an operator who really doesn’t want to give it 100% reward, and make that operator hard to hack—then the AI has a strong incentive to gain whatever power it needs. (I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
My intuition says that nothing else in the proposal actually matters for safety, except the wireheading safety valve aspect. In particular, I think just giving the AI direct exploration abilities won’t mess up safety, as long as the wireheading path is present and there’s not “99.99% certainty is better than 99.98%”-type failure mode.
(I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
If you’re right about this, the setup is not safe. I’m going to respond to this in the “concerns about the box” section. I don’t think a commitment to give the agent high reward if it jumps through a few hoops will save us.
In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).
My intuition says that nothing else in the proposal actually matters for safety
That’s what I would conclude as well if the box were not secure.
In particular, I think just giving the AI direct exploration abilities won’t mess up safety,
See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.
Can you expand a bit on why a commitment to give a high reward won’t save us? Is it a matter of the AI seeking more certainty, or is there some other issue?
An example of a mind-killing “mind” to me, even if it has no direct, veridical content, being able to put the AI into an environment that seems to be too hostile.
the goal at stake is the ability to not just put a mind under the environment you think of as your true goal. (My current model of the world is that there’s a single goal, and only a single goal can be achieved in this world.)
the AI isn’t allowed to try and get out of an environment within which it’s in control. It can make its own goals—it can make money—by making a lot of money in the same way people enjoy huge amounts of free time.
the AI is allowed to run in a completely unpredictable environment, out of the experimental space. However, its options would be:
it can make thousands of copies of itself, only taking some of its resources and collecting enough money to run a very very complicated AI;
it can make thousands of copies of itself, only doing this very complicated behavior;
it can make thousands of copies of itself, each of which is doing it together, and collecting much more money in the course of its evolution (and perhaps also in the hands of other Minds), until it gets to the point where it can’t make millions of copies of itself, or if not it’s in a simulated universe as it intends to.
So what’s the right thing to do? Where should we be going with this?
“I see you mean something else” is also equivalent to “I don’t know how you mean something different”.
You don’t think that, say, it’s better to be safe. You don’t know what’s going wrong. So you don’t want to put up with the problem and start trying new strategies, when no one’s already done something stupid. (It’s not clear to me at all how to resolve this problem. If you can’t be certain how to resolve this problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems. But if you’re not sure how to resolve the problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems, and you can always take a more careful approach.
There are probably other things (e.g. more complicated solutions, more complicated problems, etc.) which are more expensive, but I don’t think it’s something that is worth the risk to human civilization, and may be worth it. I think this is a useful suggestion, but it depends a bit on how it relates, and it’s probably not something that you can write up very precisely.
What if we had some simple way of solving this problem without needing to be safe? I think a solution to the problem would involve some serious technical effort, and an understanding that the “solving” problem won’t be solved by “solving”, but it is the problem of Friendly AI which you see here not missing some big conceptual insight.
One way that I would go about solving the problem would be to build a safe AGI, and build the safety solution. That way “solving” problems won’t always be safe, but (and also won’t make the exact problem safe), the “solving” problem won’t always be safe, and any solution to safe AI will probably be safe. But it would be nice if it worked for practical purposes; if it worked for a big goal, the problem would be safe.
In the world where the solutions are safe, there are no fundamentally scary alternatives so long as their safety is secure, and so the safety solution won’t be scary to humans.
So, yes, it is an AGI safety problem that the system of AGIs will face, because it will not need to be dangerous. But what if the system of AGI does not need to be safe. The only reason to have an AI safety problem is that we want to have a system which is safe. So our AI safety problem will not always be scary to humans, but it definitely will be. We might not be able to solve it one way or another.
The way to make progress on safety is to build an AGI system that can create an AGI system sufficiently smart that at least one of the world’s most intelligent humans is be created. A system which has a safety net are extremely difficult to build. A system which has a safety net of highly trained humans is extremely difficult to build. And so on. The safety net of an AGI system can scale with time and scale with capability.
I think that the problem seems to be that if the world was already as dumb as we think, we should want to do great safety research. If you want to do great safety research, you are going to have to be a lot smarter than the average scientist or programmer. You can’t build an AGI that can actually accomplish anything to the world’s challenges. You have to be the first in person.
I would take a second to say that I want to focus more on these questions than on actually designing an AGI. In
This should probably have been better title than “So you want to have a complete and thorough understanding of the subject matter.”
This should work to the extent that you should post a summary like the one I just gave, rather than the one Anna seems to think will be best for your audience. I think the sequence version should have been clarified to note that it’s very easy, and if we didn’t have the full version now (we should do that), or if we did have a version that just sounded so much like what you’ve planned for (the next post is a summary in the “So you want to have an understanding of the subject matter of”? That’s definitely something that is quite valuable to have here, and it’s important to get people to read it.
Other algorithms… would eventually seek arbitrary power in the world in order to intervene in the provision of its own reward; this follows straightforwardly from its directive to maximize reward
The conclusion seems false; AUP (IJCAI, LW) is a reward maximizer which does not exhibit this behavior. For similar reasons, the recent totalitarian convergence conjecture made here also seems not true.
AUP seems really promising. I just meant other algorithms that have been proven generally intelligent, which is really just AIXI, the Thompson Sampling Agent, BayesExp, and a couple other variants on Bayesian agents with large model classes.
This may be a dumb question, but how can you asymptotically guarantee human-level intelligence when the world-models have bounded computation time, and the human is a “computable function” that has no such limit? Is it because the number of Turing machines is infinite?
My main concern is that the system will be neither safe, nor useful. Its usefulness is limited by the number of tasks which could be solved in such setup. These tasks are probably some advises or projects: “should we develop wind energy” or “print the the blueprint of a nanorobot”. The main problem is that an advice can look good but have some unintended consequences, which are not obvious. For example, widespread wind farms will kill birds and result in pest proliferation. If the AI continue to work, it could adjust its advise to this new data, but the program has stopped.
In other words, sealed Oracle can give bad advice and can’t be punished for that.
There will be strong incentive by the system’s owner to disable all this safety measures, and use the system in real world continuously, so all Omohundro’s drives appear all over again, but on the level of the system’s owners.
Moreover, the system can feel rewarded now by knowing that it will in future dominate in the universe, so the system may feel the need to put a secret code in its output (if it is a long code) which will create many its copies in the wild. For example, in order to tile the universe with windfarms. If it can’t output complex code, it will be mostly useless.
There will be strong incentive by the system’s owner to disable all this safety measures
If the operators believe that without the safety measures, humanity would be wiped out, I think they won’t jettison them. More to the point, running this algorithm does not put more pressure on the operators to try out a dangerous AI. What ever incentive existed already is not the fault of this algorithm.
The problem of any human operator is other human operators, e.g. Chinese vs. American. This creates exactly the same dynamics as was explored by Omohundro: the strong incentive to grab the power and take more risky actions.
You dissect the whole system on two parts, and then claim that one of the parts is “safe”. But the same thing can be done with any AI: just say that its memory or any other part is safe.
What would constitute a solution to the problem of the race to the bottom between teams of AGI developers as they sacrifice caution to secure a strategic advantage besides the conjunction of a) technical proposals and b) multilateral treaties? Is your complaint that I make no discussion of b? I think we can focus on these two things one at a time.
There could be, in fact, many solutions, starting from prevention AI creation at all – and up to creation so many AIs that they will balance each other. I have an article with overview of possible “global” solutions.
I don’t think you should discuss different global solutions, as it would be off topic. But the discussion of the whole system of “boxed AI + AI creators” may be interesting.
Comment thread: general concerns/confusions
Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
I’m concerned about overloading the word “benign” with a new concept (mainly not seeking power outside the box, if I understand correctly) that doesn’t match either informal usage or a previous technical definition. In particular this “benign” AGI (in the limit) will hack the operator’s mind to give itself maximum reward, if that’s possible, right?
The system seems limited to answering questions that the human operator can correctly evaluate the answers to within a single episode (although I suppose we could make the episodes very long and allow multiple humans into the room to evaluate the answer together). (We could ask it other questions but it would give answers that sound best to the operator rather than correct answers.) If you actually had this AGI today, what questions would you ask it?
If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode. Maybe not a big deal but it’s kind of a counter-example to “We argue that our algorithm produces an AGI that, even if it became omniscient, would continue to accomplish whatever task we wanted, instead of hijacking its reward, eschewing its task, and neutralizing threats to it, even if it saw clearly how to do exactly that.”
(Feel free to count this as some number of comments between 1 and 4, since some of the above items are related. Also I haven’t read most of the math yet and may have more comments and questions once I understood the motivations and math better.)
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).
Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
This seems useful if you could get around the mind hacking problem, but how would you do that?
I don’t know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I’m not sure what else you might have in mind.
On second thought, (even assuming away the mind hacking problem) if you ask about “how to make a safe unbounded AGI” and “what’s wrong with the answer” in separate episodes, you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn’t enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you’d have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you’ve definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of “AI Safety via Debate” was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search.
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
I suspect that AI Safety via Debate could be benign for certain decisions (like whether to release an AI) if we were to weight the debate more towards the safer option.
Do you have thoughts on this?
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
… but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I’m not sure how to tell what will really happen short of actually having a superintelligent AI to test with.
You’re right (see the redaction). Why Wei is right. Here’s an unpolished idea though: they could do something like minimax. Instead of simulating the other agent, they could model the environment as responding to a pair of actions. For inference, they would have the history of their opponent’s actions as well, and for planning, they could pick their action to maximize their objective assuming the other agent’s actions are maximally inconvenient.
So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
Yep.
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B’s previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can’t ask for any counter-arguments from B without risking being mind hacked by B.
(I realize this is a problem in “AI Safety via debate” as well, but I’m asking you since you’re here and Geoffrey Irving isn’t. :)
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for pointing it out.
I don’t understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn’t read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they’re trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
So actually I framed my point above wrong: “demanding that A use their words” could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).
But your original point was “why doesn’t A just claim B is mind-hacking” not “why doesn’t B just mind-hack”? The answer to that point was “demand A use their words rather than negotiate an end to the conversation” or more moderately, “75%-demand that A do this.”
Oh, I see, I didn’t understand “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).
I don’t think the “AI safety via debate” paper actually makes arguments for this assumption (at least I couldn’t find where it does). Do you have reasons to think it’s true, or ideas for how to verify that it’s true, short of putting a human in a BoMAI?
Yeah… I don’t have much to add here. Let’s keep thinking about this. I wonder if Paul is more bullish on the premise that “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” than I am?
Recall that this idea was to avoid
If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn’t originally what brought us here.
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
Instead of hypnosis, I’m more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn’t optimized for by the AI).
If you try to solve the mind hacking problem iteratively, you’re more likely to find a way to get useful answers out of the system, but you’re also more likely to hit upon an existentially catastrophic form of mind hacking.
I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in “AI Safety via Debate” if I remember correctly.
It is plausible to me that there is selection pressure to make the operator “devoted” in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills.
Just to step back and frame this conversation, we’re discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don’t. (Again, that is just the default before we look closer). This doesn’t mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
Yeah, the threat model I have in mind isn’t the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity’s potential (which is part of the definition of “existential risk”). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good.
Empirical analysis might not be useless here in evaluating the “surprisingness” of XX. I don’t think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good.
I’m adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
Can you explain this?
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is “Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design.” the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead “Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer.”
Hm. This doesn’t seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment)
but I don’t imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn’t seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere).
Maybe other people can weigh in on this, and we can come back to it.
I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”
The best alternative to “benign” that I could come up with is “unambitious”. I’m not very good at this type of thing though, so maybe ask around for other suggestions or indicate somewhere prominent that you’re interested in giving out a prize specifically for this?
What do you think about “aligned”? (in the sense of having goals which don’t interfere with our own, by being limited in scope to the events of the room)
To clarify, I’m looking for:
“We’re talking about what you do, not what you do.”
“Suppose you give us a new toy/summarized toy, something like a room, an inside-view view thing, and ask them to explain what you desire.”
“Ah,” you reply, “I’m asking what you think about how your life would go if you lived it way up until now. I think I would be interested in hearing about that.
“Oh? I’d think about that, and I might want to think about it a bit more. So I would say, for example, that you might want to give someone a toy/summarized toy by the same criteria as other people in the room and make them play the role of toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing.
It seems like the answer would be quite different.
“Oh, then,” you say, “That seems like too much work. Let me try harder!”
“What about that—what does this all just—so close to the real thing, don’t you think? And that I shouldn’t think such things are real?”
“Not exactly. But don’t you think there should be any al variative reasons why this is always so hard, or that any al variative reasons are not just illuminative, or couldn’t be found some other way?”
“That’s not exactly how I would put it. I’m fully closed about it. I’m still working on it. I don’t know whether I could get this outcome without spending so much effort on finding this particular method of doing something, because I don’t think it would happen without them trying it, so it’s not like they’re trying to determine whether that outcome is real or not.”
“Ah...” said your friend, staring at you in horror. “So, did you ever even think of the idea, or did it just
What do you think about “domesticated”?
My comment is more like:
A second comment, but it doesn’t seem worth an answer: it can’t be an explicit statement of what would happen if you tried this, and it seems to me unlikely that my initial reaction when it was presented in the first place was insincere, so it seems like a really good idea to let it propagate in your mind a little. I’m hoping a lot of good ideas do become useful this time.
There’s still an existential risk in the sense that the AGI has an incentive to hack the operator to give it maximum reward, and that hack could have powerful effects outside the box (even though the AI hasn’t optimized it for that purpose), for example it might turn out to be a virulent memetic virus. Of course this is much less risky than if the AGI had direct instrumental goals outside the box, but “benign” and “not existentially dangerous” both seem to be claiming a bit too much. I’ll think about what other term might be more suitable.
The first nuclear reaction initiated an unprecedented temperature in the atmosphere, and people were right to wonder whether this would cause the atmosphere to ignite. The existence of a generally intelligent agent is likely to cause unprecedented mental states in humans, and we would be right to wonder whether that will cause an existential catastrophe. I think the concern of “could have powerful effects outside the box” is mostly captured by the unprecedentedness of this mental state, since the mental state is not selected to have those side effects. Certainly there is no way to rule out side-effects of inside-the-box events, since these side effects are the only reason it’s useful. And there is also certainly no way to rule out how those side effects “might turn out to be,” without a complete view of the future.
Would you agree that unprecedentedness captures the concern?
I think my concern is a bit more specific than that. See this comment.
From the formal description of the algorithm, it looks like you use a universal prior to pick k, and then allow the kth Turing machine to run for ℓ steps, but don’t penalize the running time of the machine that outputs k. Is that right? That didn’t match my intuitive understanding of the algorithm, and seems like it would lead to strange outcomes, so I feel like I’m misunderstanding.
Yes this is correct. If you use the same bijection consistently from strings to natural numbers, it looks a little more intuitive than if you don’t. The universal prior picks k (the number) by outputting k as a string. The kth Turing machine is the Turing machine described by k as a string. So you end up looking at the Kolmogorov complexity of the description of the Turing machine. So the construction of the description of the world-model isn’t time-penalized. This doesn’t change the asymptotic result, so I went with the more familiar K(x) rather than translating this new speed prior into measure over finite strings, which would require some more exposition, but I agree with you it feels like there might be some strange outcomes “before the limit” as a result of this approach: namely, the code on the UTM that outputs the description of the world-model-Turing-machine will try to do as much of the computation as possible in advance, by computing the description of an speed-optimized Turing machine for when the actions start coming.
The other reasonable choices here instead of K(x) are S(x) (constructed to be like the new speed prior here) and—ℓ(x)the length of x. But ℓ(x) basically tells you that a Turing machine with fewer states is simpler, which would lead to a measure over H∞ that is dominated by world-models that are just universal Turing machines, which defeats the purpose of doing maximum a posteriori instead of a Bayes mixture. The way this issue appears in the proof renders the Natural Prior Assumption less plausible.
This invalidates some of my other concerns, but also seems to mean things are incredibly weird at finite times. I suspect that you’ll want to change to something less extreme here.
(I might well be misunderstanding something, apologies in advance.)
Suppose the “intended” physics take at least 1E15 steps to run on the UTM (this is a conservative lower bound, since you have to simulate the human for the whole episode). And suppose β<0.999 (I think you need β much lower than this). Then the intended model gets penalized by at least exp(1E12) for its slowness.
For almost the same description complexity, I could write down physics + “precompute the predictions for the first N episodes, for every sequence of possible actions/observations, and store them in a lookup table.” This increases the complexity by a few bits, some constant plus K(N|physics), but avoids most of the computation. In order for the intended physics to win, i.e. in order for the “speed” part of the speed prior to do anything, we need the complexity of this precomputed model to be at least 1E12 bits higher than the complexity of the fast model.
That appears to happen only once N > BB(1E12). Does that seem right to you?
We could talk about whether malign consequentialists also take over at finite times (I think they probably do, since the “speed” part of the speed prior is not doing any work until after BB(1E12) steps, long after the agent becomes incredibly smart), but it seems better to adjust the scheme first.
Using the speed prior seems more reasonable, but I’d want to know which version of the speed prior and which parameters, since which particular problem bites you will depend on those choices. And maybe to save time, I’d want to first get your take on whether the proposed version is dominated by consequentialists at some finite time.
Yes. I recall thinking about precomputing observations for various actions in this phase, but I don’t recall noticing how bad the problem was not in the limit.
This goes in the category of “things I can’t rule out”. I say maybe 1⁄5 chance it’s actually dominated by consequentialists (that low because I think the Natural Prior Assumption is still fairly plausible in its original form), but for all intents and purposes, 1⁄5 is very high, and I’ll concede this point.
2−K(s)(1+ε) is a measure over binary strings. Instead, let’s try ∑p∈{0,1}∗:U(p)=s2−ℓ(p)βcT(U,p), where ℓ(p) is the length of p, T(U,p) is the time it takes to run p on U, and c is a constant. If there were no cleverer strategy than precomputing observations for all the actions, then c could be above |A|−md, where d is the number of episodes we can tolerate not having a speed prior for. But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then c would have to be above 1/d.
What problem do you think bites you?
Do you get down to 20% because you think this argument is wrong, or because you think it doesn’t apply?
What’s β? Is it O(1) or really tiny? And which value of c do you want to consider, polynomially small or exponentially small?
Wouldn’t they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss? Predicting what BoMAI will do seems likely to be much easier than that.
You argument is about a Bayes mixture, not a MAP estimate; I think the case is much stronger that consequentialists can take over a non-trivial fraction of a mixture. I think that the methods with consequentialists discover for gaining weight in the prior (before the treacherous turn) are mostly likely to be elegant (short description on UTM), and that is the consequentialists’ real competition; then [the probability the universe they live in produces them with their specific goals]or [the bits to directly specify a consequentialist deciding to to do this] set them back (in the MAP context).
I don’t see why their methods would be elegant. In particular, I don’t see why any of {the anthropic update, importance weighting, updating from the choice of universal prior} would have a simple form (simpler than the simplest physics that gives rise to life).
I don’t see how MAP helps things either—doesn’t the same argument suggest that for most of the possible physics, the simplest model will be a consequentialist? (Even more broadly, for the universal prior in general, isn’t MAP basically equivalent to a random sample from the prior, since some random model happens to be slightly more compressible?)
Yeah I think we have different intuitions here; are we at least within a few bits of log-odds disagreement? Even if not, I am not willing to stake anything on this intuition, so I’m not sure this is a hugely important disagreement for us to resolve.
I didn’t realize that you think that a single consequentialist would plausibly have the largest share of the posterior. I assumed your beliefs were in the neighborhood of:
(from your original post on this topic). In a Bayes mixture, I bet that a team of consequentialists that collectively amount to 1⁄10 or even 1⁄50 of the posterior could take over our world. In MAP, if you’re not first, you’re last, and more importantly, you can’t team up with other consequentialist-controlled world-models in the mixture.
Let’s say β=0.9, c=1/20.
Oh yeah—that’s good news.
Although I don’t really like to make anything that would fall apart if the world were deterministic. Relying on stochasticity feels wrong to me.
How does your AI know to avoid running internal simulations containing lots of suffering?
It does not; thank you for pointing this out! This feature would have to be added on. Maybe you can come up with a way.
Would you mind explaining what the retracted part was? Even if it was a mistake, pointing it out might be useful to others thinking along the same lines.
Sorry, I probably shouldn’t have written the sentence in the first place; it was an AI capabilities idea.
Are there UDT-ish concerns with breaking isolation of episodes? For example, if the AI receives a low reward at the beginning of episode 117, does it have an incentive to manipulate the external world to make episode 117 happen many times somehow, with most of these times giving it a higher reward? For another example, can the AI at episode 117 realize that it’s in a game theory situation with the AI at episodes 116 and 118 and trade rewards with them acausally, leading to long-term goal directed behavior?
BoMAI is a causal decision theorist.
My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
With the correction that it is updateless and CDT (see here), I agree with the rest of this.
It’s actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there’s no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)
For any given world-model, episode 117 is just a string of actions on the input tape, and observations and rewards on the output tape (positions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
Okay so I think you could construct a world-model that reflects this sort of reasoning, where it associates reward with the reward provided to a randomly sampled instance of its algorithm in the world in a way that looks like this. But the “malign output that would result in additional invocations of itself” would require the operator to leave the room, so this has the same form as, for example, ν†. At this point, I think we’re no longer considering anything that sounds like “episode 117 happening twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the rewards/observations provided to the two separate instances ever diverge.
Once the AI becomes sufficiently powerful, it is definitely going to hack the operator any time it expects the operator to give a reward less than 1. So the operator’s input is really binary, at least once the AI has learned an accurate model. Given that, why allow non-binary rewards at all? Is it just supposed to provide faster learning early on?
Along similar lines: once the AI has learned an accurate model, why would we expect it to ever provide anything useful at all, rather than just hacking the operator all day? Do we think that hacking the human is likely to be harder than obtaining perfect rewards every time without hacking the human? Seems like that would depend very heavily on the problem at hand, and on the operator’s feedback strategy.
To put it differently: this setup will not provide a solution to any problem which is more difficult than hacking the human operator.
Yep. I actually expect it to be pretty hard to hack the operator compared to other tasks, but more importantly, if it turns out to be easier, we can tinker with the setup until hacking the operator becomes very hard, since a mistake here isn’t an existential risk. For example, we could add a monitor who observes the operator, and can set the rewards of the episode to 0 if he observes anything suspicious; the box would include them both, but the operator couldn’t easily communicate to the monitor, and if they tried, the monitor could immediately end the episode.
This comment has a recap of other parts of the conversation on this topic, alongside a more detailed proposal.
So I think this proposal is a specific instance of a general principle: wireheading acts as a safety valve. As long as the AI is prone to wireheading, and you leave it some path to wireheading, the AI will never bother with anything more difficult than that path. In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
This kind of setup creates a direct trade-off between safety and capability: the harder it is to hack the human operator, the more likely that some other escape channel is easier—manipulating electrical circuitry to push atoms around, or some weird thing like that. The AI will become an existential threat if you send in an operator who really doesn’t want to give it 100% reward, and make that operator hard to hack—then the AI has a strong incentive to gain whatever power it needs. (I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
My intuition says that nothing else in the proposal actually matters for safety, except the wireheading safety valve aspect. In particular, I think just giving the AI direct exploration abilities won’t mess up safety, as long as the wireheading path is present and there’s not “99.99% certainty is better than 99.98%”-type failure mode.
If you’re right about this, the setup is not safe. I’m going to respond to this in the “concerns about the box” section. I don’t think a commitment to give the agent high reward if it jumps through a few hoops will save us.
I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).
That’s what I would conclude as well if the box were not secure.
See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.
Can you expand a bit on why a commitment to give a high reward won’t save us? Is it a matter of the AI seeking more certainty, or is there some other issue?
An example of a mind-killing “mind” to me, even if it has no direct, veridical content, being able to put the AI into an environment that seems to be too hostile.
the goal at stake is the ability to not just put a mind under the environment you think of as your true goal. (My current model of the world is that there’s a single goal, and only a single goal can be achieved in this world.)
the AI isn’t allowed to try and get out of an environment within which it’s in control. It can make its own goals—it can make money—by making a lot of money in the same way people enjoy huge amounts of free time.
the AI is allowed to run in a completely unpredictable environment, out of the experimental space. However, its options would be:
it can make thousands of copies of itself, only taking some of its resources and collecting enough money to run a very very complicated AI;
it can make thousands of copies of itself, only doing this very complicated behavior;
it can make thousands of copies of itself, each of which is doing it together, and collecting much more money in the course of its evolution (and perhaps also in the hands of other Minds), until it gets to the point where it can’t make millions of copies of itself, or if not it’s in a simulated universe as it intends to.
So what’s the right thing to do? Where should we be going with this?
“I see you mean something else” is also equivalent to “I don’t know how you mean something different”.
You don’t think that, say, it’s better to be safe. You don’t know what’s going wrong. So you don’t want to put up with the problem and start trying new strategies, when no one’s already done something stupid. (It’s not clear to me at all how to resolve this problem. If you can’t be certain how to resolve this problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems. But if you’re not sure how to resolve the problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems, and you can always take a more careful approach.
There are probably other things (e.g. more complicated solutions, more complicated problems, etc.) which are more expensive, but I don’t think it’s something that is worth the risk to human civilization, and may be worth it. I think this is a useful suggestion, but it depends a bit on how it relates, and it’s probably not something that you can write up very precisely.
What if we had some simple way of solving this problem without needing to be safe? I think a solution to the problem would involve some serious technical effort, and an understanding that the “solving” problem won’t be solved by “solving”, but it is the problem of Friendly AI which you see here not missing some big conceptual insight.
One way that I would go about solving the problem would be to build a safe AGI, and build the safety solution. That way “solving” problems won’t always be safe, but (and also won’t make the exact problem safe), the “solving” problem won’t always be safe, and any solution to safe AI will probably be safe. But it would be nice if it worked for practical purposes; if it worked for a big goal, the problem would be safe.
In the world where the solutions are safe, there are no fundamentally scary alternatives so long as their safety is secure, and so the safety solution won’t be scary to humans.
So, yes, it is an AGI safety problem that the system of AGIs will face, because it will not need to be dangerous. But what if the system of AGI does not need to be safe. The only reason to have an AI safety problem is that we want to have a system which is safe. So our AI safety problem will not always be scary to humans, but it definitely will be. We might not be able to solve it one way or another.
The way to make progress on safety is to build an AGI system that can create an AGI system sufficiently smart that at least one of the world’s most intelligent humans is be created. A system which has a safety net are extremely difficult to build. A system which has a safety net of highly trained humans is extremely difficult to build. And so on. The safety net of an AGI system can scale with time and scale with capability.
I think that the problem seems to be that if the world was already as dumb as we think, we should want to do great safety research. If you want to do great safety research, you are going to have to be a lot smarter than the average scientist or programmer. You can’t build an AGI that can actually accomplish anything to the world’s challenges. You have to be the first in person.
I would take a second to say that I want to focus more on these questions than on actually designing an AGI. In
This should probably have been better title than “So you want to have a complete and thorough understanding of the subject matter.”
This should work to the extent that you should post a summary like the one I just gave, rather than the one Anna seems to think will be best for your audience. I think the sequence version should have been clarified to note that it’s very easy, and if we didn’t have the full version now (we should do that), or if we did have a version that just sounded so much like what you’ve planned for (the next post is a summary in the “So you want to have an understanding of the subject matter of”? That’s definitely something that is quite valuable to have here, and it’s important to get people to read it.
The conclusion seems false; AUP (IJCAI, LW) is a reward maximizer which does not exhibit this behavior. For similar reasons, the recent totalitarian convergence conjecture made here also seems not true.
AUP seems really promising. I just meant other algorithms that have been proven generally intelligent, which is really just AIXI, the Thompson Sampling Agent, BayesExp, and a couple other variants on Bayesian agents with large model classes.
This may be a dumb question, but how can you asymptotically guarantee human-level intelligence when the world-models have bounded computation time, and the human is a “computable function” that has no such limit? Is it because the number of Turing machines is infinite?
Not a dumb question; bounded computation time here means bounded computation time per episode, so really it’s linear computation time.
My main concern is that the system will be neither safe, nor useful. Its usefulness is limited by the number of tasks which could be solved in such setup. These tasks are probably some advises or projects: “should we develop wind energy” or “print the the blueprint of a nanorobot”. The main problem is that an advice can look good but have some unintended consequences, which are not obvious. For example, widespread wind farms will kill birds and result in pest proliferation. If the AI continue to work, it could adjust its advise to this new data, but the program has stopped.
In other words, sealed Oracle can give bad advice and can’t be punished for that.
There will be strong incentive by the system’s owner to disable all this safety measures, and use the system in real world continuously, so all Omohundro’s drives appear all over again, but on the level of the system’s owners.
Moreover, the system can feel rewarded now by knowing that it will in future dominate in the universe, so the system may feel the need to put a secret code in its output (if it is a long code) which will create many its copies in the wild. For example, in order to tile the universe with windfarms. If it can’t output complex code, it will be mostly useless.
linking this to the discussion between Wei Dai and me here.
If the operators believe that without the safety measures, humanity would be wiped out, I think they won’t jettison them. More to the point, running this algorithm does not put more pressure on the operators to try out a dangerous AI. What ever incentive existed already is not the fault of this algorithm.
The problem of any human operator is other human operators, e.g. Chinese vs. American. This creates exactly the same dynamics as was explored by Omohundro: the strong incentive to grab the power and take more risky actions.
You dissect the whole system on two parts, and then claim that one of the parts is “safe”. But the same thing can be done with any AI: just say that its memory or any other part is safe.
If we look at the whole system including AI and its human operators it will have unsafe dynamics as whole. I wrote about it more in “Military AI as convergent goal of AI development”
What would constitute a solution to the problem of the race to the bottom between teams of AGI developers as they sacrifice caution to secure a strategic advantage besides the conjunction of a) technical proposals and b) multilateral treaties? Is your complaint that I make no discussion of b? I think we can focus on these two things one at a time.
There could be, in fact, many solutions, starting from prevention AI creation at all – and up to creation so many AIs that they will balance each other. I have an article with overview of possible “global” solutions.
I don’t think you should discuss different global solutions, as it would be off topic. But the discussion of the whole system of “boxed AI + AI creators” may be interesting.
I do not see any mapping from these concepts to its action-selection-criterion of maximizing the rewards for its current episode.