Asymptotically Unambitious AGI
Original Post:
We present an algorithm [updated version], then show (given four assumptions) that in the limit, it is human-level intelligent and benign.
Will MacAskill has commented that in the seminar room, he is a consequentialist, but for decision-making, he takes seriously the lack of a philosophical consensus. I believe that what is here is correct, but in the absence of feedback from the Alignment Forum, I don’t yet feel comfortable posting it to a place (like arXiv) where it can get cited and enter the academic record. We have submitted it to IJCAI, but we can edit or revoke it before it is printed.
I will distribute at least min($365, number of comments * $15) in prizes by April 1st (via venmo if possible, or else Amazon gift cards, or a donation on their behalf if they prefer) to the authors of the comments here, according to the comments’ quality. If one commenter finds an error, and another commenter tinkers with the setup or tinkers with the assumptions in order to correct it, then I expect both comments will receive a similar prize (if those comments are at the level of prize-winning, and neither person is me). If others would like to donate to the prize pool, I’ll provide a comment that you can reply to.
To organize the conversation, I’ll start some comment threads below:
Positive feedback
General Concerns/Confusions
Minor Concerns
Concerns with Assumption 1
Concerns with Assumption 2
Concerns with Assumption 3
Concerns with Assumption 4
Concerns with “the box”
Adding to the prize pool
Edit 30/5/19: An updated version is on arXiv. I now feel comfortable with it being cited. The key changes:
The Title. I suspect the agent is unambitious for its entire lifetime, but the title says “asymptotically” because that’s what I’ve shown formally. Indeed, I suspect the agent is benign for its entire lifetime, but the title says “unambitious” because that’s what I’ve shown formally. (See the section “Concerns with Task-Completion” for an informal argument going from unambitious → benign).
The Useless Computation Assumption. I’ve made it a slightly stronger assumption. The original version is technically correct, but setting is tricky if the weak version of the assumption is true but the strong version isn’t. This stronger assumption also simplifies the argument.
The Prior. Rather than having to do with the description length of the Turing machine simulating the environment, it has to do with the number of states in the Turing machine. This was in response to Paul’s point that the finite-time behavior of the original version is really weird. This also makes the Natural Prior Assumption (now called the No Grue Assumption) a bit easier to assess.
Edit 17/02/20: Published at AAAI. The prior over world-models is now totally different, and much better. There’s no “amnesia antechamber” required. The Useless Computation Assumption and the No Grue Assumption are now obselete. The argument for unambitiousness now depends on the “Space Requirements Assumption”, which we probed empirically. The ArXiv link is up-to-date.
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- [AN #87]: What might happen as deep learning scales even further? by 19 Feb 2020 18:20 UTC; 28 points) (
- [AN #54] Boxing a finite-horizon AI system to keep it unambitious by 28 Apr 2019 5:20 UTC; 20 points) (
- The “AI Debate” Debate by 2 Jul 2020 10:16 UTC; 20 points) (
- [AN #55] Regulatory markets and international standards as a means of ensuring beneficial AI by 5 May 2019 2:20 UTC; 17 points) (
- Value Learning is only Asymptotically Safe by 9 Mar 2023 13:32 UTC; 5 points) (
Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.
Comment prizes:
Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20
A plausible dangerous side-effect. Wei Dai. Link. $40
Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120
Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20
Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90
Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35
Simulating suffering agents. cousin_it. Link. $20
Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20
Options for transfer:
1) Venmo. Send me a request at @Michael-Cohen-45.
2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).
3) Name a charity for me to donate the money to.
I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than your reflectively endorsed preferences, if you’re up for that. On that note, here’s one more option:
4) Send me a private message with a shipping address, and I’ll get you something cool (or a few things).
If I have a great model of physics in hand (and I’m basically unconcerned with competitiveness, as you seem to be), why not just take the resulting simulation of the human and give it a long time to think? That seems to have fewer safety risks and to be more useful.
More generally, under what model of AI capabilities / competitiveness constraints would you want to use this procedure?
I know I don’t prove it, but I think this agent would be vastly superhuman, since it approaches Bayes-optimal reasoning with respect to its observations. (“Approaches” because MAP → Bayes).
For the asymptotic results, one has to consider environments that produce observations with the true objective probabilities (hence the appearance that I’m unconcerned with competitiveness). In practice, though, given the speed prior, the agent will require evidence to entertain slow world-models, and for the beginning of its lifetime, the agent will be using low-fidelity models of the environment and the human-explorer, rendering it much more tractable than a perfect model of physics. And I think that even at that stage, well before it is doing perfect simulations of other humans, it will far surpass human performance. We manage human-level performance with very rough simulations of other humans.
That leads me to think this approach is much more competitive that simulating a human and giving it a long time to think.
I’m keen on asymptotic analysis, but if we want to analyze safety asymptotically I think we should also analyze competitiveness asymptotically. That is, if our algorithm only becomes safe in the limit because we shift to a super uncompetitive regime, it undermines the use of the limit as analogy to study the finite time behavior.
(Though this is not the most interesting disagreement, probably not worth responding to anything other than the thread where I ask about “why do you need this memory stuff?”)
Definitely agree. I don’t think it’s the case that a shift to super uncompetitiveness is actually an “ingredient” to benignity, but my only discussion of that so far is in the conclusion: “We can only offer informal claims regarding what happens before BoMAI is definitely benign...”
Surely that just depends on how long you give them to think. (See also HCH.)
By competitiveness, I meant usefulness per unit computation.
The algorithm takes an argmax over an exponentially large space of sequences of actions, i.e. it does 2^{episode length} model evaluations. Do you think the result is smarter than a group of humans of size 2^{episode length}? I’d bet against—the humans could do this particular brute force search, in which case you’d have a tie, but they’d probably do something smarter.
I obviously haven’t solved the Tractable General Intelligence problem. The question is whether this is a tractable/competitive framework. So expectimax planning would naturally get replaced with a Monte-Carlo tree search, or some better approach we haven’t thought of. And I’ll message you privately about a more tractable approach to identifying a maximum a posteriori world-model from a countable class (I don’t assign a very high probability to it being a hugely important capabilities idea, since those aren’t just lying around, but it’s more than 1%).
It will be important, when considering any of these approximations, to evaluate whether they break benignity (most plausibly, I think, by introducing a new attack surface for optimization daemons). But I feel fine about deferring that research for the time being, so I defined BoMAI as doing expectimax planning instead of MCTS.
Given that the setup is basically a straight reinforcement learner with a weird prior, I think that at that level of abstraction, the ceiling of competitiveness is quite high.
I’m sympathetic to this picture, though I’d probably be inclined to try to model it explicitly—by making some assumption about what the planning algorithm can actually do, and then showing how to use an algorithm with that property. I do think “just write down the algorithm, and be happier if it looks like a ‘normal’ algorithm” is an OK starting point though
Stepping back from this particular thread, I think the main problem with competitiveness is that you are just getting “answers that look good to a human” rather than “actually good answers.” If I try to use such a system to navigate a complicated world, containing lots of other people with more liberal AI advisors helping them do crazy stuff, I’m going to quickly be left behind.
It’s certainly reasonable to try to solve safety problems without attending to this kind of competitiveness, though I think this kind of asymptotic safety is actually easier than you make it sound (under the implicit “nothing goes irreversibly wrong at any finite time” assumption).
Starting a new thread on this:
here.
From Paul:
The comment was here, but I think it deserves its own thread. Wei makes the same point here (point number 3), and our ensuing conversation is also relevant to this thread.
My answers to Wei were two-fold: one is that if benignity is established, it’s possible to safely tinker with the setup until hopefully “answers that look good to a human” resembles good answers (we never quite reached an agreement about this). The second was an example of an extended setup (one has to read the parent comments to understand it) which would potentially be much more likely to yield actually good answers; I think we agree about this approach.
My original idea when I started working on this, actually, is also an answer to this concern. The reason it’s not in the paper is because I pared it down to a minimum viable product.
Construct an “oracle” by defining “true answers” as follows: answers which help a human do accurate prediction on a randomly sampled prediction task.*
I figured out that I needed a box, and everything else in this setup, and I realized that the setup could be applied to a normal reinforcement learner just as easily as for this oracle, so I simplified the approach.
I honestly need to dig through notes from last year, but my recollection is this: the operator receives an answer to a query, and then gets a random prediction task, which he has to make a prediction about before leaving the box. Later, the prediction is scored, and this is converted into a reward for BoMAI. BoMAI has a model class for how the prediction is scored; the output of these models is an answer for what the ground truth is. In all of these models, the ground truth doesn’t depend on BoMAI’s answer (that is, the model isn’t given read access to BoMAI’s answer). So the prediction task can involve the prediction of outside world events, and the ground truth can be logged from the outside world, because BoMAI doesn’t conceive of its answer having a causal impact on the copy of the world which provides the ground truth for the prediction tasks. For example, the prediction task might sampled from {“True or false: hexalated kenotones will suppress activity of BGQ-1”, “True or false: fluorinating random lysines in hemoglobin will suppress activity of BGQ-1”, etc.} (half of those terms are made up). After this episode, the prediction can be graded in the outside world. With the obvious scoring rule, the oracle would just say “I don’t care plausible it sounds, whatever they ask you, just say it’s not going to work. Most things don’t.” With a better scoring rule, I would expect it to give accurate information in a human-understandable format.
I haven’t thought about this in a while, and I was honestly worse at thinking about alignment at that point in time, so I don’t mean to convey much confidence that this approach works out. What I do think it shows, alongside the idea I came up with in the conversation with Wei, linked above, is that this general approach is powerful and amenable to improvement in ways that render it even more useful.
* A more recent thought: as described, “oracle” is not the right word for this setup. It would respond to “What approaches might work for curing cancer?” with “Doesn’t matter. There are more gaps in your knowledge regarding economics. A few principles to keep in mind…” However, if the prediction task distribution were conditioned in some way on the question asked, one might be able to make it more likely that the “oracle” answers the question, rather than just spewing unrelated insight.
Here is an old post of mine on the hope that “computationally simplest model describing the box” is actually a physical model of the box. I’m less optimistic than you are, but it’s certainly plausible.
From the perspective of optimization daemons / inner alignment, I think like the interesting question is: if inner alignment turns out to be a hard problem for training cognitive policies, do we expect it to become much easier by training predictive models? I’d bet against at 1:1 odds, but not 1:2 odds.
If I’m understanding correctly, and I’m very unsure that I am, you’re comparing the model-based approach of [learn the environment then do good planning] with [learn to imitate a policy]. (Note that any iterated approach to improving a policy requires learning the environment, so I don’t see what “training cognitive policies” could mean besides imitation learning.) And the question you’re wondering about is whether optimization daemons become easier to avoid when following the [learn the environment then do good planning] approach.
Imitation learning is about prediction just as much as predictive models are—predictive models imitate the environment. So I suppose optimization daemons are about equally likely to appear?
My real answer, though, is that I’m not sure, but vanilla imitation learning isn’t competitive.
But I suspect I’ve misunderstood your question.
I don’t actually rely on this assumption, although it underpins the intuition behind Assumption 2.
I agree that you don’t rely on this assumption (so I was wrong to assume you are more optimistic than I am). In the literal limit, you don’t need to care about any of the considerations of the kind I was raising in my post.
Given that you are taking limits, I don’t see why you need any of the machinery with forgetting or with memory-based world models (and if you did really need that machinery, it seems like your proof would have other problems). My understanding is:
Your already assume that you can perform arbitrarily many rounds of the algorithm as intended (or rather you prove that there is some n0 such that if you ran n0 steps, with everything working as intended and in particular with no memory corruption, then you would get “benign” behavior).
Any time the MAP model makes a different prediction from the intended model, it loses some likelihood. So this can only happen finitely many times in any possible world. Just take n0 to be after the last time it happens w.h.p.
What’s wrong with this?
Notational note: I use i0 to denote the episode when BoMAI becomes demonstrably benign and n0 for something else.
Any time any model makes a different on-policy prediction from the intended model, it loses some likelihood (in expectation). The off-policy predictions don’t get tested. Under a policy that doesn’t cause the computer’s memory to be tampered with (which is plausible, even ideal), ν† and ν⋆ are identical, so we can’t count on ν† losing probability mass relative to ν⋆. The approach here is to set it up so that world-models like ν† either start with a lower prior, or else eventually halt when they exhaust their computation budget.
I agree with that, but if they are always making the same on-policy prediction it doesn’t matter what happens to their relative probability (modulo exploration). The agent can’t act on an incentive to corrupt memory infinitely often, because each time requires the models making a different prediction on-policy. So the agent only acts on such an incentive finitely many times, and hence never does so after some sufficiently late episode i0. Agree/disagree?
(Having a bad model can still hurt, since the bogus model might agree on-policy but assign lower rewards off-policy. But if they also always approximately agree on the exploration distribution, then a bad model also can’t discourage exploration. And if they don’t agree on the exploration distribution, then the bad model will eventually get tested.)
Ah I see what you’re saying.
I suppose I constrained myself to producing an algorithm/setup where the asymptotic benignity result followed from reasons that don’t require dangerous behavior in the interim.
Also, you can add another parameter to BoMAI where you just have the human explorer explore for the first E episodes. The i0 in the Eventual Benignity Theorem can be thought of as the max of i’ and i″. i’ comes from the i0 in Lemma 1 (Rejecting the Simple Memory-Based). i″ comes from the point in time when ^ν(i) is ε-accurate on policy, which renders Lemma 3 applicable. (And Lemma 2 always applies). My initial thought was to set E so that the human explorer is exploring for the whole time when the MAP world-model was not necessarily benign. This works for i’. E can just be set to be greater than i’. The thing it doesn’t work for is i″. If you increase E, the value of i″ goes up as well.
So in fact, if you set E large enough, the first time BoMAI controls the episode, it will be benign. Then, there is a period where it might not be benign. However, from that point on, the only “way” for a world-model to be malign is by being worse than ε-inaccurate on-policy, because Lemmas 1 and 2 have already kicked in, and if it were ε-accurate on-policy, Lemma 3 would kick in as well. The first point to make about this is that in this regime, benignity comes in tandem with intelligence—it has to be confused to be dangerous (like a self-driving car). The second point is: I can’t come up with an example of world-model which is plausibly maximum a posteriori in this interval of time, and which is plausibly dangerous (for what that’s worth; and I don’t like to assume it’s worth much because it took me months to notice ν†).
I think my point is this:
The intuitive thing you are aiming at is stronger than what the theorem establishes (understandably!)
You probably don’t need the memory trick to establish the theorem itself.
Even with the memory trick, I’m not convinced you meet the stronger criterion. There are a lot of other things similar to memory that can cause trouble—the theorem is able to avoid them only because of the same unsatisfying asymptotic feature that would have caused it to avoid memory-based models even without the amnesia.
This is a conceptual approach I hadn’t considered before—thank you. I don’t think it’s true in this case. Let’s be concrete: the asymptotic feature that would have caused it to avoid memory-based models even without amnesia is trial and error, applied to unsafe policies. Every section of the proof, however, can be thought of as making off-policy predictions behave. The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”. So while there might be malign world-models of a different flavor to the memory-based ones, I don’t think the way this theorem treats them is unsatisfying.
Comment thread: positive feedback
Upvoted for interesting experiments with bounties and comment formatting.
I like that you emphasize and discuss the need for the AI to not believe that it can influence the outside world, and cleanly distinguish this from it actually being able to influence the outside world. I wonder if you can get any of the benefits here without needing the box to actually work (i.e. can you just get the agent to believe it does? and is that enough for some form/degree of benignity?)
I think that I may want to make a more specific reply.
Comment thread: general concerns/confusions
Can you give some intuitions about why the system uses a human explorer instead of doing exploring automatically?
I’m concerned about overloading the word “benign” with a new concept (mainly not seeking power outside the box, if I understand correctly) that doesn’t match either informal usage or a previous technical definition. In particular this “benign” AGI (in the limit) will hack the operator’s mind to give itself maximum reward, if that’s possible, right?
The system seems limited to answering questions that the human operator can correctly evaluate the answers to within a single episode (although I suppose we could make the episodes very long and allow multiple humans into the room to evaluate the answer together). (We could ask it other questions but it would give answers that sound best to the operator rather than correct answers.) If you actually had this AGI today, what questions would you ask it?
If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode. Maybe not a big deal but it’s kind of a counter-example to “We argue that our algorithm produces an AGI that, even if it became omniscient, would continue to accomplish whatever task we wanted, instead of hijacking its reward, eschewing its task, and neutralizing threats to it, even if it saw clearly how to do exactly that.”
(Feel free to count this as some number of comments between 1 and 4, since some of the above items are related. Also I haven’t read most of the math yet and may have more comments and questions once I understood the motivations and math better.)
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
Whatever policy is used for exploration, we can ensure that BoMAI will eventually outperform this policy. With a human executing the policy, this leads to BoMAI accumulating reward at least as well as a human. Under the “smarter” information theoretic exploratory policies that I’ve considered, exploratory behavior is unsafe from insatiable curiosity: the agent has to try killing everyone just to check to make sure it’s not a weird cheat code.
Yes. BoMAI would be able to give plausible-sounding answers to questions. BoMAI could also do any task that was automatically checkable: don’t use a human operator at all; have an automated system which interprets text as an amino acid sequence; synthesize that protein; measure some feature of it’s behavior; provide reward accordingly. (That example invites renewed focus on the impermeability of the box, by the way).
Some things I would do is send an eminent cancer researcher in to ask BoMAI for a research proposal. Then the researcher could go out and test it. It might be worthless, no matter how plausible it seemed, but then they could go back having learned something about a failed path. Repeating this process, it seems likely to me that a correct idea would appear, just considering the likelihood of appearing plausible to a better and better trained evaluator.
I would also naturally ask it how to make a safe unbounded AGI. And the next episode, I would ask for an explanation for why that would fail.
REDACTED: On that topic, in addition to having multiple humans in the box, you could also have 2 agents that the operator interacts with, both of which are clones except that the reward for the second is one minus the reward for the first. This would look like “AI Safety via debate.”
This seems useful if you could get around the mind hacking problem, but how would you do that?
I don’t know how this would work in terms of your setup. The most obvious way would seem to require the two agents to simulate each other, which would be impossible, and I’m not sure what else you might have in mind.
On second thought, (even assuming away the mind hacking problem) if you ask about “how to make a safe unbounded AGI” and “what’s wrong with the answer” in separate episodes, you’re essentially manually searching an exponentially large tree of possible arguments, counterarguments, counter-counterarguments, and so on. (Two episodes isn’t enough to determine whether the first answer you got was a good one, because the second answer is also optimized for sounding good instead of being actually correct, so you’d have to do another episode to ask for a counter-argument to the second answer, and so on, and then once you’ve definitively figured out that some answer/node was bad, you have to ask for another answer at that node and repeat this process.) The point of “AI Safety via Debate” was to let AI do all this searching for you, so it seems that you do have to figure out how to do something similar to avoid the exponential search.
ETA: Do you know if the proposal in “AI Safety via Debate” is “asymptotically benign” in the sense you’re using here?
No! Either debater is incentivized to take actions that get the operator to create another artificial agent that takes over the world, replaces the operator, and settles the debate in favor of the debater in question.
I guess we can incorporate into DEBATE the idea of building a box around the debaters and judge with a door that automatically ends the episode when opened. Do you think that would be sufficient to make it “benign” in practice? Are there any other ideas in this paper that you would want to incorporate into a practical version of DEBATE?
Add the retrograde amnesia chamber and an explorer, and we’re pretty much at this, right?
Without the retrograde amnesia, it might still be benign, but I don’t know how to show it. Without the explorer, I doubt you can get very strong usefulness results.
I suspect that AI Safety via Debate could be benign for certain decisions (like whether to release an AI) if we were to weight the debate more towards the safer option.
Do you have thoughts on this?
I expect the human operator moderating this debate would get pretty good at thinking about AGI safety, and start to become noticeably better at dismissing bad reasoning than good reasoning, at which point BoMAI would find the production of correct reasoning a good heuristic for seeming convincing.
… but yes, it is still exponential (exponential in what, exactly? maybe the number of concepts we have handles for?); this comment is the real answer to your question.
Alternatively, the human might have a lot of adversarial examples and the debate becomes an exercise in exploring all those adversarial examples. I’m not sure how to tell what will really happen short of actually having a superintelligent AI to test with.
You’re right (see the redaction). Why Wei is right. Here’s an unpolished idea though: they could do something like minimax. Instead of simulating the other agent, they could model the environment as responding to a pair of actions. For inference, they would have the history of their opponent’s actions as well, and for planning, they could pick their action to maximize their objective assuming the other agent’s actions are maximally inconvenient.
So you basically have the same AI play both sides of the zero-sum game, right? That seems like it should work, with the same caveat as for “AI Safety via debate”, namely that it seems hard to predict what happens when you have superintelligent AIs play a zero-sum game with a human as the judge.
Yep.
With a debate-like setup, if one side (A) is about to lose a debate, it seems to have a high incentive to claim that the other side (B) trying to do a mind hack and that if the judge keeps paying attention to what B says (i.e., read any further output from B), they will soon be taken over. What is the judge supposed to do in this case? They could ask A to explain how B’s previous outputs constitute part of an attempt to mind hack, but A could concoct a story mixed with its own attempt to mind hack, and the judge can’t ask for any counter-arguments from B without risking being mind hacked by B.
(I realize this is a problem in “AI Safety via debate” as well, but I’m asking you since you’re here and Geoffrey Irving isn’t. :)
*but A could concoct a story … counterarguments from B .. mind hacked by B, right?
I think the main contention of their paper is that it is harder to mystify a judge than it is to pierce through someone else mystifying a judge, so this shouldn’t be a problem.
That said, here’s one possibility: if A ever suggests that you don’t read more output from B, don’t read anything more from B, then flip coins to give A a 25% chance of victory.
Yeah, I mixed up the A’s and B’s at the end. It’s fixed now. Thanks for pointing it out.
I don’t understand how the former implies the latter. Assuming the former is true (and it seems like a big assumption), why can’t what I suggested still happen?
But what about the case where B is actually trying to mind hack the judge? If you always give A a 25% chance of victory for suggesting or implying that you shouldn’t read more output from B, then mind hacking becomes a (mostly) winning strategy, since a player gets a 75% chance of victory from mind hacking even if the other side successfully convinces the judge that they’re trying to mind hack the judge. The equilibrium might then consist of a race to see who can mind hack the judge first, or (if one side has >75% chance of winning such a race due to first-mover or second-mover advantage) one side trying to mind hack the judge, getting blocked by the other side, and still getting 75% victory.
If the assumption is true, we could demand that A use their words, and counter us being mind-hacked by poking holes in what B is saying rather than demanding we stop listening to B. And if A is able to convince us that B was mind-hacking, even after some more back and forth, B will be punished for that.
So actually I framed my point above wrong: “demanding that A use their words” could look like the protocol I describe; it is not something that would work independently of the assumption that it is easier to deflate an attempted mind-hacking than it is to mind-hack (with an equal amount of intelligence/resources).
But your original point was “why doesn’t A just claim B is mind-hacking” not “why doesn’t B just mind-hack”? The answer to that point was “demand A use their words rather than negotiate an end to the conversation” or more moderately, “75%-demand that A do this.”
Oh, I see, I didn’t understand “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” correctly. So this assumption basically rules out a large class of possible vulnerabilities in the judge, right? For example, if the judge had the equivalent of a buffer overflow bug in a network stack, the scheme would fail. In that case, A would not be able to “pierce through” B’s attack and stop it with its words if the judge keeps listening to B (and B was actually attacking).
I don’t think the “AI safety via debate” paper actually makes arguments for this assumption (at least I couldn’t find where it does). Do you have reasons to think it’s true, or ideas for how to verify that it’s true, short of putting a human in a BoMAI?
Yeah… I don’t have much to add here. Let’s keep thinking about this. I wonder if Paul is more bullish on the premise that “it is harder to mystify a judge than it is to pierce through someone else mystifying a judge” than I am?
Recall that this idea was to avoid
If it also reduces the risk of operator-devotion, and it might well do that (because a powerful adversary is opposed to that), that wasn’t originally what brought us here.
If the person leaves having been hypnotized, since it’s not catastrophic, we can modify the setup. For example, the room could be divided in half, with Operator 1 interacting BoMAI, and with Operator 2 observing Operator 1 for signs of hypnosis or for signs of an attempt to communicate with Operator 2, and in either event, Operator 2 could end the episode immediately. (That would be Operator 2′s only ability). More importantly than the magnitude of [(the probability that this works) - (the probability it would work the first time)] is the fact that this can be done iteratively.
A bit of a side note: I’m curious what odds you give to hypnosis (or something like it) being the best way to get optimal reward for such a task (in the vanilla setup).
Instead of hypnosis, I’m more worried about the AI talking the operator into some kind of world view that implies they should be really generous to the AI (i.e., give it max rewards), or give some sequence of answers that feel extremely insightful (and inviting further questions/answers in the same vein). And then the operator might feel a desire afterwards to spread this world view or sequence of answers to others (even though, again, this wasn’t optimized for by the AI).
If you try to solve the mind hacking problem iteratively, you’re more likely to find a way to get useful answers out of the system, but you’re also more likely to hit upon an existentially catastrophic form of mind hacking.
I guess it depends on how many interactions per episode and how long each answer can be. I would say >.9 probability that hypnosis or something like what I described above is optimal if they are both long enough. So you could try to make this system safer by limiting these numbers, which is also talked about in “AI Safety via Debate” if I remember correctly.
It is plausible to me that there is selection pressure to make the operator “devoted” in some sense to BoMAI. But most people with a unique motive are not able to then take over the world or cause an extinction event. And BoMAI has no incentive to help the operator gain those skills.
Just to step back and frame this conversation, we’re discussing the issue of outside-world side-effects that correlate with in-the-box instrumental goals. Implicit in the claim of the paper is that technological progress is an outside-world correlate of operator-satisfaction, an in-the-box instrumental goal. I agree it is very much worth considering plausible pathways to negative consequences, but I think the default answer is that with optimization pressure, surprising things happen, but without optimization pressure, surprising things don’t. (Again, that is just the default before we look closer). This doesn’t mean we should be totally skeptical about the idea of expecting technological progress or long-term operator devotion, but it does contribute to my being less concerned that something as surprising as extinction would arise from this.
Yeah, the threat model I have in mind isn’t the operator taking over the world or causing an extinction event, but spreading bad but extremely persuasive ideas that can drastically curtail humanity’s potential (which is part of the definition of “existential risk”). For example fulfilling our potential may require that the universe eventually be controlled mostly by agents that have managed to correctly solve a number of moral and philosophical problems, and the spread of these bad ideas may prevent that from happening. See Some Thoughts on Metaphilosophy and the posts linked from there for more on this perspective.
Let XX be the event in which: a virulent meme causes sufficiently many power-brokers to become entrenched with absurd values, such that we do not end up even satisficing The True Good.
Empirical analysis might not be useless here in evaluating the “surprisingness” of XX. I don’t think Christianity makes the cut either for virulence or for incompatibility with some satisfactory level of The True Good.
I’m adding this not for you, but to clarify for the casual reader: we both agree that a Superintelligence setting out to accomplish XX would probably succeed; the question here is how likely this is to happen by accident if a superintelligence tries to get a human in a closed box to love it.
Can you explain this?
Suppose there are n forms of mind hacking that the AI could do, some of which are existentially catastrophic. If your plan is “Run this AI, and if the operator gets mind-hacked, stop and switch to an entirely different design.” the likelihood of hitting upon an existentially catastrophic form of mind hacking is lower than if the plan is instead “Run this AI, and if the operator gets mind-hacked, tweak the AI design to block that specific form of mind hacking and try again. Repeat until we get a useful answer.”
Hm. This doesn’t seem right to me. My approach for trying to form an intuition here includes returning to the example (in a parent comment)
but I don’t imagine this satisfies you. Another piece of the intuition is that mind-hacking for the aim of reward within the episode, or even the possible instrumental aim of operator-devotion, still doesn’t seem very existentially risky to me, given the lack of optimization pressure to that effect. (I know the latter comment sort of belongs in other branches of our conversation, so we should continue to discuss it elsewhere).
Maybe other people can weigh in on this, and we can come back to it.
I’m open to other terminology. Yes, there is no guarantee about what happens to the operator. As I’m defining it, benignity is defined to be not having outside-world instrumental goals, and the intuition for the term is “not existentially dangerous.”
The best alternative to “benign” that I could come up with is “unambitious”. I’m not very good at this type of thing though, so maybe ask around for other suggestions or indicate somewhere prominent that you’re interested in giving out a prize specifically for this?
What do you think about “aligned”? (in the sense of having goals which don’t interfere with our own, by being limited in scope to the events of the room)
To clarify, I’m looking for:
“We’re talking about what you do, not what you do.”
“Suppose you give us a new toy/summarized toy, something like a room, an inside-view view thing, and ask them to explain what you desire.”
“Ah,” you reply, “I’m asking what you think about how your life would go if you lived it way up until now. I think I would be interested in hearing about that.
“Oh? I’d think about that, and I might want to think about it a bit more. So I would say, for example, that you might want to give someone a toy/summarized toy by the same criteria as other people in the room and make them play the role of toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing toy-maximizing.
It seems like the answer would be quite different.
“Oh, then,” you say, “That seems like too much work. Let me try harder!”
“What about that—what does this all just—so close to the real thing, don’t you think? And that I shouldn’t think such things are real?”
“Not exactly. But don’t you think there should be any al variative reasons why this is always so hard, or that any al variative reasons are not just illuminative, or couldn’t be found some other way?”
“That’s not exactly how I would put it. I’m fully closed about it. I’m still working on it. I don’t know whether I could get this outcome without spending so much effort on finding this particular method of doing something, because I don’t think it would happen without them trying it, so it’s not like they’re trying to determine whether that outcome is real or not.”
“Ah...” said your friend, staring at you in horror. “So, did you ever even think of the idea, or did it just
What do you think about “domesticated”?
My comment is more like:
A second comment, but it doesn’t seem worth an answer: it can’t be an explicit statement of what would happen if you tried this, and it seems to me unlikely that my initial reaction when it was presented in the first place was insincere, so it seems like a really good idea to let it propagate in your mind a little. I’m hoping a lot of good ideas do become useful this time.
There’s still an existential risk in the sense that the AGI has an incentive to hack the operator to give it maximum reward, and that hack could have powerful effects outside the box (even though the AI hasn’t optimized it for that purpose), for example it might turn out to be a virulent memetic virus. Of course this is much less risky than if the AGI had direct instrumental goals outside the box, but “benign” and “not existentially dangerous” both seem to be claiming a bit too much. I’ll think about what other term might be more suitable.
The first nuclear reaction initiated an unprecedented temperature in the atmosphere, and people were right to wonder whether this would cause the atmosphere to ignite. The existence of a generally intelligent agent is likely to cause unprecedented mental states in humans, and we would be right to wonder whether that will cause an existential catastrophe. I think the concern of “could have powerful effects outside the box” is mostly captured by the unprecedentedness of this mental state, since the mental state is not selected to have those side effects. Certainly there is no way to rule out side-effects of inside-the-box events, since these side effects are the only reason it’s useful. And there is also certainly no way to rule out how those side effects “might turn out to be,” without a complete view of the future.
Would you agree that unprecedentedness captures the concern?
I think my concern is a bit more specific than that. See this comment.
From the formal description of the algorithm, it looks like you use a universal prior to pick k, and then allow the kth Turing machine to run for ℓ steps, but don’t penalize the running time of the machine that outputs k. Is that right? That didn’t match my intuitive understanding of the algorithm, and seems like it would lead to strange outcomes, so I feel like I’m misunderstanding.
Yes this is correct. If you use the same bijection consistently from strings to natural numbers, it looks a little more intuitive than if you don’t. The universal prior picks k (the number) by outputting k as a string. The kth Turing machine is the Turing machine described by k as a string. So you end up looking at the Kolmogorov complexity of the description of the Turing machine. So the construction of the description of the world-model isn’t time-penalized. This doesn’t change the asymptotic result, so I went with the more familiar K(x) rather than translating this new speed prior into measure over finite strings, which would require some more exposition, but I agree with you it feels like there might be some strange outcomes “before the limit” as a result of this approach: namely, the code on the UTM that outputs the description of the world-model-Turing-machine will try to do as much of the computation as possible in advance, by computing the description of an speed-optimized Turing machine for when the actions start coming.
The other reasonable choices here instead of K(x) are S(x) (constructed to be like the new speed prior here) and—ℓ(x)the length of x. But ℓ(x) basically tells you that a Turing machine with fewer states is simpler, which would lead to a measure over H∞ that is dominated by world-models that are just universal Turing machines, which defeats the purpose of doing maximum a posteriori instead of a Bayes mixture. The way this issue appears in the proof renders the Natural Prior Assumption less plausible.
This invalidates some of my other concerns, but also seems to mean things are incredibly weird at finite times. I suspect that you’ll want to change to something less extreme here.
(I might well be misunderstanding something, apologies in advance.)
Suppose the “intended” physics take at least 1E15 steps to run on the UTM (this is a conservative lower bound, since you have to simulate the human for the whole episode). And suppose β<0.999 (I think you need β much lower than this). Then the intended model gets penalized by at least exp(1E12) for its slowness.
For almost the same description complexity, I could write down physics + “precompute the predictions for the first N episodes, for every sequence of possible actions/observations, and store them in a lookup table.” This increases the complexity by a few bits, some constant plus K(N|physics), but avoids most of the computation. In order for the intended physics to win, i.e. in order for the “speed” part of the speed prior to do anything, we need the complexity of this precomputed model to be at least 1E12 bits higher than the complexity of the fast model.
That appears to happen only once N > BB(1E12). Does that seem right to you?
We could talk about whether malign consequentialists also take over at finite times (I think they probably do, since the “speed” part of the speed prior is not doing any work until after BB(1E12) steps, long after the agent becomes incredibly smart), but it seems better to adjust the scheme first.
Using the speed prior seems more reasonable, but I’d want to know which version of the speed prior and which parameters, since which particular problem bites you will depend on those choices. And maybe to save time, I’d want to first get your take on whether the proposed version is dominated by consequentialists at some finite time.
Yes. I recall thinking about precomputing observations for various actions in this phase, but I don’t recall noticing how bad the problem was not in the limit.
This goes in the category of “things I can’t rule out”. I say maybe 1⁄5 chance it’s actually dominated by consequentialists (that low because I think the Natural Prior Assumption is still fairly plausible in its original form), but for all intents and purposes, 1⁄5 is very high, and I’ll concede this point.
2−K(s)(1+ε) is a measure over binary strings. Instead, let’s try ∑p∈{0,1}∗:U(p)=s2−ℓ(p)βcT(U,p), where ℓ(p) is the length of p, T(U,p) is the time it takes to run p on U, and c is a constant. If there were no cleverer strategy than precomputing observations for all the actions, then c could be above |A|−md, where d is the number of episodes we can tolerate not having a speed prior for. But if it somehow magically predicted which actions BoMAI was going to take in no time at all, then c would have to be above 1/d.
What problem do you think bites you?
Do you get down to 20% because you think this argument is wrong, or because you think it doesn’t apply?
What’s β? Is it O(1) or really tiny? And which value of c do you want to consider, polynomially small or exponentially small?
Wouldn’t they have to also magically predict all the stochasticity in the observations, and have a running time that grows exponentially in their log loss? Predicting what BoMAI will do seems likely to be much easier than that.
You argument is about a Bayes mixture, not a MAP estimate; I think the case is much stronger that consequentialists can take over a non-trivial fraction of a mixture. I think that the methods with consequentialists discover for gaining weight in the prior (before the treacherous turn) are mostly likely to be elegant (short description on UTM), and that is the consequentialists’ real competition; then [the probability the universe they live in produces them with their specific goals]or [the bits to directly specify a consequentialist deciding to to do this] set them back (in the MAP context).
I don’t see why their methods would be elegant. In particular, I don’t see why any of {the anthropic update, importance weighting, updating from the choice of universal prior} would have a simple form (simpler than the simplest physics that gives rise to life).
I don’t see how MAP helps things either—doesn’t the same argument suggest that for most of the possible physics, the simplest model will be a consequentialist? (Even more broadly, for the universal prior in general, isn’t MAP basically equivalent to a random sample from the prior, since some random model happens to be slightly more compressible?)
Yeah I think we have different intuitions here; are we at least within a few bits of log-odds disagreement? Even if not, I am not willing to stake anything on this intuition, so I’m not sure this is a hugely important disagreement for us to resolve.
I didn’t realize that you think that a single consequentialist would plausibly have the largest share of the posterior. I assumed your beliefs were in the neighborhood of:
(from your original post on this topic). In a Bayes mixture, I bet that a team of consequentialists that collectively amount to 1⁄10 or even 1⁄50 of the posterior could take over our world. In MAP, if you’re not first, you’re last, and more importantly, you can’t team up with other consequentialist-controlled world-models in the mixture.
Let’s say β=0.9, c=1/20.
Oh yeah—that’s good news.
Although I don’t really like to make anything that would fall apart if the world were deterministic. Relying on stochasticity feels wrong to me.
How does your AI know to avoid running internal simulations containing lots of suffering?
It does not; thank you for pointing this out! This feature would have to be added on. Maybe you can come up with a way.
Would you mind explaining what the retracted part was? Even if it was a mistake, pointing it out might be useful to others thinking along the same lines.
Sorry, I probably shouldn’t have written the sentence in the first place; it was an AI capabilities idea.
Are there UDT-ish concerns with breaking isolation of episodes? For example, if the AI receives a low reward at the beginning of episode 117, does it have an incentive to manipulate the external world to make episode 117 happen many times somehow, with most of these times giving it a higher reward? For another example, can the AI at episode 117 realize that it’s in a game theory situation with the AI at episodes 116 and 118 and trade rewards with them acausally, leading to long-term goal directed behavior?
BoMAI is a causal decision theorist.
My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
With the correction that it is updateless and CDT (see here), I agree with the rest of this.
It’s actually updateless within an episode, right, because it finds the optimal policy at the beginning of an episode and then the same policy is used throughout the episode? I think this means there’s no reason for the operator to enter a reward at every timestamp, and instead we could let them just input a total reward at any time in the episode (or they could enter any number of rewards, and the optimization is based on the last reward entered). This would remove the incentive for the AI to keep the operator in the room as long as possible and would partially address item 4 in my comment.
So “updateless” is a reasonable term to apply to BoMAI, but it’s not an updateless decision theorist in your sense (if I understand correctly). An updateless decision theorist picks a policy that has the best consequences, without making assumption that its choice of policy affects the world only through the actions it picks. It considers the possibility that an another agent will be able to perfectly simulate it, so if it picks policy 1 at the start, the other agent will simulate it following policy 1, and if it picks policy 2, the other agent will simulate it picking policy 2. Since this is an effect that isn’t mediated by actual choice of action, updatelessness ends up having consequences.
If an agent picks an expectimax policy under the assumption that the only way this choice impacts the environments is through the actions it takes (which BoMAI assumes), then it’s isomoprhic whether it computes ^ν(i)-expectimax as it goes, or all at once at the beginning. The policy at the beginning will include contingencies for whatever midway-through-the-episode position the agent might land in, and as for what to do at that point, it’s the same calculation being run. And this calculation is CDT.
I guess this means, and I’ve never thought about this before so this could easily be wrong, under the assumption that a policy’s effect on the world is screened off by which actions it takes, CDT is reflectively stable.
(And yes, you could just give one reward, which ends the episode.)
For any given world-model, episode 117 is just a string of actions on the input tape, and observations and rewards on the output tape (positions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
Okay so I think you could construct a world-model that reflects this sort of reasoning, where it associates reward with the reward provided to a randomly sampled instance of its algorithm in the world in a way that looks like this. But the “malign output that would result in additional invocations of itself” would require the operator to leave the room, so this has the same form as, for example, ν†. At this point, I think we’re no longer considering anything that sounds like “episode 117 happening twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the rewards/observations provided to the two separate instances ever diverge.
Once the AI becomes sufficiently powerful, it is definitely going to hack the operator any time it expects the operator to give a reward less than 1. So the operator’s input is really binary, at least once the AI has learned an accurate model. Given that, why allow non-binary rewards at all? Is it just supposed to provide faster learning early on?
Along similar lines: once the AI has learned an accurate model, why would we expect it to ever provide anything useful at all, rather than just hacking the operator all day? Do we think that hacking the human is likely to be harder than obtaining perfect rewards every time without hacking the human? Seems like that would depend very heavily on the problem at hand, and on the operator’s feedback strategy.
To put it differently: this setup will not provide a solution to any problem which is more difficult than hacking the human operator.
Yep. I actually expect it to be pretty hard to hack the operator compared to other tasks, but more importantly, if it turns out to be easier, we can tinker with the setup until hacking the operator becomes very hard, since a mistake here isn’t an existential risk. For example, we could add a monitor who observes the operator, and can set the rewards of the episode to 0 if he observes anything suspicious; the box would include them both, but the operator couldn’t easily communicate to the monitor, and if they tried, the monitor could immediately end the episode.
This comment has a recap of other parts of the conversation on this topic, alongside a more detailed proposal.
So I think this proposal is a specific instance of a general principle: wireheading acts as a safety valve. As long as the AI is prone to wireheading, and you leave it some path to wireheading, the AI will never bother with anything more difficult than that path. In this particular case, it will never do anything harder than hacking its human operator, and that’s where most of the safety benefit comes from.
This kind of setup creates a direct trade-off between safety and capability: the harder it is to hack the human operator, the more likely that some other escape channel is easier—manipulating electrical circuitry to push atoms around, or some weird thing like that. The AI will become an existential threat if you send in an operator who really doesn’t want to give it 100% reward, and make that operator hard to hack—then the AI has a strong incentive to gain whatever power it needs. (I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
My intuition says that nothing else in the proposal actually matters for safety, except the wireheading safety valve aspect. In particular, I think just giving the AI direct exploration abilities won’t mess up safety, as long as the wireheading path is present and there’s not “99.99% certainty is better than 99.98%”-type failure mode.
If you’re right about this, the setup is not safe. I’m going to respond to this in the “concerns about the box” section. I don’t think a commitment to give the agent high reward if it jumps through a few hoops will save us.
I disagree with this. The safety benefit comes from it not having outside-world instrumental goals (which it lacks if and only if the box is secure).
That’s what I would conclude as well if the box were not secure.
See Appendix F. If the agent picks its own exploratory policies (reasonably), the agent will try every computable policy until it dies, including the policies of every simple AGI.
Can you expand a bit on why a commitment to give a high reward won’t save us? Is it a matter of the AI seeking more certainty, or is there some other issue?
An example of a mind-killing “mind” to me, even if it has no direct, veridical content, being able to put the AI into an environment that seems to be too hostile.
the goal at stake is the ability to not just put a mind under the environment you think of as your true goal. (My current model of the world is that there’s a single goal, and only a single goal can be achieved in this world.)
the AI isn’t allowed to try and get out of an environment within which it’s in control. It can make its own goals—it can make money—by making a lot of money in the same way people enjoy huge amounts of free time.
the AI is allowed to run in a completely unpredictable environment, out of the experimental space. However, its options would be:
it can make thousands of copies of itself, only taking some of its resources and collecting enough money to run a very very complicated AI;
it can make thousands of copies of itself, only doing this very complicated behavior;
it can make thousands of copies of itself, each of which is doing it together, and collecting much more money in the course of its evolution (and perhaps also in the hands of other Minds), until it gets to the point where it can’t make millions of copies of itself, or if not it’s in a simulated universe as it intends to.
So what’s the right thing to do? Where should we be going with this?
“I see you mean something else” is also equivalent to “I don’t know how you mean something different”.
You don’t think that, say, it’s better to be safe. You don’t know what’s going wrong. So you don’t want to put up with the problem and start trying new strategies, when no one’s already done something stupid. (It’s not clear to me at all how to resolve this problem. If you can’t be certain how to resolve this problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems. But if you’re not sure how to resolve the problem, then you’re not safe, because all you can do is to have more basic issues and bigger problems, and you can always take a more careful approach.
There are probably other things (e.g. more complicated solutions, more complicated problems, etc.) which are more expensive, but I don’t think it’s something that is worth the risk to human civilization, and may be worth it. I think this is a useful suggestion, but it depends a bit on how it relates, and it’s probably not something that you can write up very precisely.
What if we had some simple way of solving this problem without needing to be safe? I think a solution to the problem would involve some serious technical effort, and an understanding that the “solving” problem won’t be solved by “solving”, but it is the problem of Friendly AI which you see here not missing some big conceptual insight.
One way that I would go about solving the problem would be to build a safe AGI, and build the safety solution. That way “solving” problems won’t always be safe, but (and also won’t make the exact problem safe), the “solving” problem won’t always be safe, and any solution to safe AI will probably be safe. But it would be nice if it worked for practical purposes; if it worked for a big goal, the problem would be safe.
In the world where the solutions are safe, there are no fundamentally scary alternatives so long as their safety is secure, and so the safety solution won’t be scary to humans.
So, yes, it is an AGI safety problem that the system of AGIs will face, because it will not need to be dangerous. But what if the system of AGI does not need to be safe. The only reason to have an AI safety problem is that we want to have a system which is safe. So our AI safety problem will not always be scary to humans, but it definitely will be. We might not be able to solve it one way or another.
The way to make progress on safety is to build an AGI system that can create an AGI system sufficiently smart that at least one of the world’s most intelligent humans is be created. A system which has a safety net are extremely difficult to build. A system which has a safety net of highly trained humans is extremely difficult to build. And so on. The safety net of an AGI system can scale with time and scale with capability.
I think that the problem seems to be that if the world was already as dumb as we think, we should want to do great safety research. If you want to do great safety research, you are going to have to be a lot smarter than the average scientist or programmer. You can’t build an AGI that can actually accomplish anything to the world’s challenges. You have to be the first in person.
I would take a second to say that I want to focus more on these questions than on actually designing an AGI. In
This should probably have been better title than “So you want to have a complete and thorough understanding of the subject matter.”
This should work to the extent that you should post a summary like the one I just gave, rather than the one Anna seems to think will be best for your audience. I think the sequence version should have been clarified to note that it’s very easy, and if we didn’t have the full version now (we should do that), or if we did have a version that just sounded so much like what you’ve planned for (the next post is a summary in the “So you want to have an understanding of the subject matter of”? That’s definitely something that is quite valuable to have here, and it’s important to get people to read it.
The conclusion seems false; AUP (IJCAI, LW) is a reward maximizer which does not exhibit this behavior. For similar reasons, the recent totalitarian convergence conjecture made here also seems not true.
AUP seems really promising. I just meant other algorithms that have been proven generally intelligent, which is really just AIXI, the Thompson Sampling Agent, BayesExp, and a couple other variants on Bayesian agents with large model classes.
This may be a dumb question, but how can you asymptotically guarantee human-level intelligence when the world-models have bounded computation time, and the human is a “computable function” that has no such limit? Is it because the number of Turing machines is infinite?
Not a dumb question; bounded computation time here means bounded computation time per episode, so really it’s linear computation time.
My main concern is that the system will be neither safe, nor useful. Its usefulness is limited by the number of tasks which could be solved in such setup. These tasks are probably some advises or projects: “should we develop wind energy” or “print the the blueprint of a nanorobot”. The main problem is that an advice can look good but have some unintended consequences, which are not obvious. For example, widespread wind farms will kill birds and result in pest proliferation. If the AI continue to work, it could adjust its advise to this new data, but the program has stopped.
In other words, sealed Oracle can give bad advice and can’t be punished for that.
There will be strong incentive by the system’s owner to disable all this safety measures, and use the system in real world continuously, so all Omohundro’s drives appear all over again, but on the level of the system’s owners.
Moreover, the system can feel rewarded now by knowing that it will in future dominate in the universe, so the system may feel the need to put a secret code in its output (if it is a long code) which will create many its copies in the wild. For example, in order to tile the universe with windfarms. If it can’t output complex code, it will be mostly useless.
linking this to the discussion between Wei Dai and me here.
If the operators believe that without the safety measures, humanity would be wiped out, I think they won’t jettison them. More to the point, running this algorithm does not put more pressure on the operators to try out a dangerous AI. What ever incentive existed already is not the fault of this algorithm.
The problem of any human operator is other human operators, e.g. Chinese vs. American. This creates exactly the same dynamics as was explored by Omohundro: the strong incentive to grab the power and take more risky actions.
You dissect the whole system on two parts, and then claim that one of the parts is “safe”. But the same thing can be done with any AI: just say that its memory or any other part is safe.
If we look at the whole system including AI and its human operators it will have unsafe dynamics as whole. I wrote about it more in “Military AI as convergent goal of AI development”
What would constitute a solution to the problem of the race to the bottom between teams of AGI developers as they sacrifice caution to secure a strategic advantage besides the conjunction of a) technical proposals and b) multilateral treaties? Is your complaint that I make no discussion of b? I think we can focus on these two things one at a time.
There could be, in fact, many solutions, starting from prevention AI creation at all – and up to creation so many AIs that they will balance each other. I have an article with overview of possible “global” solutions.
I don’t think you should discuss different global solutions, as it would be off topic. But the discussion of the whole system of “boxed AI + AI creators” may be interesting.
I do not see any mapping from these concepts to its action-selection-criterion of maximizing the rewards for its current episode.
Comment thread: concerns with Assumption 1
Since the real world is quantum, does your UTM need to be quantum too? More generally, what happens if there’s a mismatch between what computations can be done efficiently in the real world vs on the UTM?
Also, I’m not sure what category this question falls under, but can you explain the new speed prior that you use, e.g., what problems in the old speed priors was it designed to solve? (I recall noticing some issues with Schmidhuber’s speed prior but can’t find the post where I wrote about it now.)
Yes! If you’re a proper Bayesian, using the speed prior on sequence prediction for infinite sequences, you end up with surprisingly good loss bounds. This is surprising because the speed prior assigns 0 probability to infinite sequences, so the truth has no prior support.
If you use a maximum a posteriori estimate, instead of a full Bayesian mixture, and the truth has prior support, you also do fine.
But as far as I can tell, things break if you try both at once. So what I needed was a way of penalizing slow world-models, while still making sure that true environment had prior support (and in particular, the possibility of the true environment running for infinitely many timesteps). Otherwise, you don’t get any sort of intelligence result.
I feel like I’ve heard something about quadratic slowdown simulating quantum mechanics on a classical machine? But it doesn’t really matter: a simulation that required 100 computation steps per episode on a quantum computer might require 10,000 computation steps per episode on a Turing machine, but it’s still bounded.
These three sources all say simulating a quantum system or computer on a classical computer takes exponential time. Does that make a difference?
It doesn’t make a difference. Thank you for actually looking into this though.
I was thinking of this.
I’m surprised by this. Can you explain a bit more? I was thinking that an exponentially large computation bound for the TM that accurately simulates the real world would make its speed prior so small that it would be practically impossible for the AI to get enough inputs (i.e., messages from the operator) to update on to make that world model have the highest weight in the posterior.
Yes, in that sense it’s bad—I just meant it doesn’t make a difference to the theorems or pose a threat to Assumption 1.
And it isn’t quite that bad. This effect also means you could safely increase β, (penalize slowness less) which causes BoMAI to start entertaining useful models of reality sooner. To be clear, this point mitigates the difficulty you describe; it certainly doesn’t go beyond cancelling it out.
My worry at this point is that if simulating the real world using actual physics takes exponential time on your UTM, the world model with the greatest posterior may not be such a simulation but instead for example an alien superintelligence that runs efficiently on a classical TM which is predicting the behavior of the operator (using various algorithms that it came up with that run efficiently on a classical computer) and at some point the alien superintelligence will cause BoMAI to output something to mind hack the operator and then take over our universe. I’m not sure which assumption this would violate, but do you see this as a reasonable concern?
The theorem is consistent with the aliens causing trouble any finite number of times. But each time they cause the agent to do something weird their model loses some probability, so there will be some episode after which they stop causing trouble (if we manage to successfully run enough episodes without in fact having anything bad happen in the meantime, which is an assumption of the asymptotic arguments).
Thanks. Is there a way to derive a concrete bound on how long it will take for BoMAI to become “benign”, e.g., is it exponential or something more reasonable? (Although if even a single “malign” episode could lead to disaster, this may be only of academic interest.) Also, to comment on this section of the paper:
“We can only offer informal claims regarding what happens before BoMAI is definitely benign. One intuition is that eventual benignity with probability 1 doesn’t happen by accident: it suggests that for the entire lifetime of the agent, everything is conspiring to make the agent benign.”
If BoMAI can be effectively controlled by alien superintelligences before it becomes “benign” that would suggest “everything is conspiring to make the agent benign” is misleading as far as reasoning about what BoMAI might do in the mean time.
Is this noted somewhere in the paper, or just implicit in the arguments? I guess what we actually need is either a guarantee that all episodes are “benign” or a bound on utility loss that we can incur through such a scheme. (I do appreciate that “in the absence of any other algorithms for general intelligence which have been proven asymptotically benign, let alone benign for their entire lifetimes, BoMAI represents meaningful theoretical progress toward designing the latter.”)
The closest thing to a discussion of this so far is Appendix E, but I have not yet thought through this very carefully. When you ask if it is exponential, what exactly are you asking if it is exponential in?
I guess I was asking if it’s exponential in anything that would make BoMAI impractically slow to become “benign”, so basically just using “exponential” as a shorthand for “impractically large”.
I don’t think it is, thank you for pointing this out.
Agreed that would be misleading, but I don’t think it would be controlled by alien superintelligences.
Consider algorithm the alien superintelligence is running to predict the behavior of the operator which runs efficiently on a classical TM (Algorithm A). Now compare Algorithm A with Algorithm B: simulate aliens deciding to run algorithm A; run algorithm A; except at some point, figure out when to do a treacherous turn, and then do it.
Algorithm B is clearly slower than Algorithm A, so Algorithm B loses.
There is an important conversation to be had here: your particular example isn’t concerning, but maybe we just haven’t thought of an analog that is concerning. Regardless, I think has become divorced from the discussion about quantum mechanics.
This is why I try to write down all the assumptions to rule out a whole host of world-models we haven’t even considered. In the argument in the paper, the assumption that rules out this example is the Natural Prior Assumption (assumption 3), although I think for your particular example, the argument I just gave is more straightforward.
Yes but algorithm B may be shorter than algorithm A, because it could take a lot of bits to directly specify an algorithm that would accurately predict a human using a classical computer, and less bits to pick out an alien superintelligence who has an instrumental reason to invent such an algorithm. If β is set to be so near 1 that the exponential time simulation of real physics can have the highest posterior within a reasonable time, the fact that B is slower than A makes almost no difference and everything comes down to program length.
Quantum mechanics is what’s making B being slower than A not matter (via the above argument).
Epistemic status: shady
So I’m a bit baffled by the philosophy here, but here’s why I haven’t been concerned with the long time it would take BoMAI to entertain the true environment (and it might well, given a safe value of β).
There is relatively clear distinction one can make between objective probabilities and subjective ones. The asymptotic benignity result makes use of world-models that perfectly match the objective probabilities rising to the top.
Consider a new kind of probability: a “k-optimal subjective probability.” That is, the best (in the sense of KL divergence) approximation of the objective probabilities that can be sampled from using a UTM and using only k computation steps. Suspend disbelief for a moment, and suppose we thought of these probabilities as objective probabilities. My intuition here is that everything works just great when agents treat subjective probabilities like real probabilities, and to a k-bounded agent, it feels like there is some sense in which these might as well be objective probabilities; the more intricate structure is inaccessible. If no world-models were considered that allowed more than k computation steps per timestep (mk per episode I guess, whatever), then just by calling “k-optimal subjective probabilities” “objective,” the same benignity theorems would apply, where the role in the proofs of [the world-model that matches the objective probabilities] is replaced by [the world-model that matches the k-optimal subjective probabilities]. And in this version, i0 comes much sooner, and the limiting value of intelligence is reached much sooner.
Of course, “the limiting value of intelligence” is much less, because only fast world-models are considered. But that just goes to show that even if, on a human timescale, BoMAI basically never fields a world-model that actually matches objective probabilities, along the way, it will still be fielding the best ones available that use a more modest computation budget. Once the computation budget surpasses the human brain, that should suffice for it to be practically intelligent.
EDIT: if one sets β to be safe, then if this logic fails, BoMAI will be useless, not dangerous.
If there’s an efficient classical approximation of quantum dynamics, I bet this has a concise and lovely mathematical description. I bet that description is much shorter than “in Conway’s game of life, the efficient approximation of quantum mechanics that whatever lifeform emerges will probably come up with.”
But I’m hesitant here. This is exactly the sort of conversation I wanted to have.
I doubt that there’s an efficient classical approximation of quantum dynamics in general. There are probably tricks to speed up the classical approximation of a human mind though (or parts of a human mind), that an alien superintelligence could discover. Consider this analogy. Suppose there’s a robot stranded on a planet without technology. What’s the shortest algorithm for controlling the robot such that it eventually leaves that planet and reaches another star? It’s probably some kind of AGI that has an instrumental goal of reaching another star, right? (It could also be a terminal goal, but there are many other terminal goals that call for interstellar travel as an instrumental goal so the latter seems more likely.) Leaving the planet calls for solving many problems that come up, on the fly, including inventing new algorithms for solving them. If you put all these individual solutions and algorithms together that would also be an algorithm for reaching another star but it could be a lot longer than the code for the AGI.
I see—so I think I make the same response on a different level then.
My model for this is: the world-model is a stochastic simple world, something like Conway’s game of life (but with randomness). Life evolves. The output channel has distinguished within-world effects, so that inhabitants can recognize it. The inhabitants control the output channel and use some of their world’s noise to sample from a universal prior, which they then feed into the output channel. But they don’t just use any universal prior—they use a better one, one which updates the prior over world-models as if the observation has been made: “someone in this world-model is sampling from the universal prior.” Maybe they also started with a speed prior of some form (which would cause them to be more likely to output the fast approximation of the human mind we were just discussing). And then after a while, they mess with the output.
Whatever better universal prior they come up with (e.g. anthropically updated speed prior), I think has a short description—shorter than [- log prob(intelligent life evolves and picks it) + description of simple universe].
It doesn’t make sense to me that they’re sampling from a universal prior and feeding it into the output channel, because the aliens are trying to take over other worlds through that output channel (and presumably they also have a distinguished input channel to go along with it), so they should be focusing on finding worlds that both can be taken over via the channel (including figuring out the computational costs of doing so) and are worth taking over (i.e., offers greater computational resources than their own), and then generating outputs that are optimized for taking over those worlds. Maybe this can be viewed as sampling from some kind of universal prior (with a short description), but I’m not seeing it. If you think it can or should be viewed that way, can you explain more?
In particular, if they’re trying to take over a computationally richer world, like ours, they have to figure out how to make sufficient predictions about the richer world using their own impoverished resources, which could involve doing research that’s equivalent to our physics, chemistry, biology, neuroscience, etc. I’m not seeing how sampling from “anthropically updated speed prior” would do the equivalent of all that (unless you end up sampling from a computation within the prior that consists of some aliens trying to take over our world).
I think you might be more or less right here.
I hadn’t thought about the can-do and the worth-doing update, in addition to the anthropic update. And it’s not that important, but for terminology’s sake, I forgot that the update could send a world-model’s prior to 0, so the prior might not be universal anymore.
The reason I think of these steps as updates to what started as a universal prior, is that they would like to take over as many possible worlds as possible, and they don’t know which one. And the universal prior is a good way to predict the dynamics of a world you know nothing about.
If you want to make fast predictions about an unknown world, I think that’s what we call a speed prior. Once the alien race has submitted a sequence of observations, they should act as if the observations were largely correct, because that’s the situation in which anything they do matters, so they are basically “learning” about the world they are copying (along with what they get from their input channel, of course, which corresponds to the operator’s actions). Sampling from a speed prior allows the aliens to output quick-to-compute plausible continuations of what they’ve outputted already. Hence, my reduction from [research about various topics] to [sampling from a speed prior].
But—when you add in the can-do update and the worth-doing update, I agree with you that the resulting measure (speed prior + anthropic update + can-do update + worth-doing update) might have a longer description than the measure which starts like that, then takes a treacherous turn. This case seems different to me (so I don’t make the same objection on this level) because the can-do update and the worth-doing update are about this treacherous turn.
So let me back up here. I don’t say anything in the Natural Prior Assumption about “for sufficiently small β,” but this makes me think I might need to. As I suggested above, I do think there is huge computational overhead that comes from having evolved life in a world running an algorithm on a “virtual machine” in their Turing-machine-simulated world, compared to the algorithm just being run on a Turing machine that is specialized for that algorithm. (75% confidence that life in a universe leads to egregious slowdown; 97% confidence that running on a virtual machine leads to at least 2x slowdown). And without the aliens involved, the “predict well” part is simpler than “predict well” + “treacherous turn.” In this version of the Natural Prior Assumption, the intuition is that control flow takes time to evaluate, even if in rare circumstances it doesn’t require more code. (Really, the reasoning that got us here is that in the rare case that treacherous world-models may be shorter to describe, they are also very slow, but that “intuition” seems less likely to generalize to world-models we haven’t thought about). But I do think a maximum a posteriori estimate using a speed prior would be pretty devastating to optimization daemons.
Another point that I was considering, and I’m not exactly sure where this fits in, but I have the intuition that for the speed prior that they use when they design plausible output, if they start with one that’s either more severe or less severe than the one we use, they lose ground. After i episodes, there’s a trade-off to be made between accuracy and speed for being a high-posterior world-model (where accuracy is more favored as i increases). If their speed prior isn’t severe enough, then at any given point in time, the world-model they use to pipe to output will be slower, which takes them more computation, which penalizes them. If their speed prior is too severe, they’ll be too focused on approximating and lose to more accurate world-models whose relative slowness we’re prepared to accommodate. I think their best bet is to match our speed prior, and take whatever advantage they can get from the anthropic update and picking their battles (the other two updates). Add “matching our prior” to the list of “things that make it hard to take over a universal prior.”
I’m glad that I’m getting some of my points across, but I think we still have some remaining disagreements or confusions here.
That doesn’t seem right to me. A speed prior still favors short algorithms. If you’re trying to make predictions about a computationally richer universe, why favor short algorithms? Why not apply your intelligence to try to discover the best algorithm (or increasingly better algorithms), regardless of the length?
Also, sampling from a speed prior involves randomizing over a mixture of TMs, but from an EU maximization perspective, wouldn’t running one particular TM from the mixture give the highest expected utility? Why are the aliens sampling from the speed prior instead of directly picking a specific algorithm to generate the next output, one that they expect to give the highest utility for them?
What happens if β is too small? If it’s really tiny, then the world model with the highest posterior is random, right, because it’s “computed” by a TM that (to minimize run time) just copies everything on its random tape to the output? And as you increase β, the TM with highest posterior starts doing fast and then increasingly compute-intensive predictions?
I think if β is small but not too small, the highest posterior would not involve evolved life, but instead a directly coded AGI that runs “natively” on the TM who can decide to execute arbitrary algorithms “natively” on the TM.
Maybe there is still some range of β where BoMAI is both safe and useful (can answer sophisticated questions like “how to build a safe unbounded AGI”) because in that range the highest posterior is a good non-life/non-AGI prediction algorithm. But A) I don’t know an argument for that, and B) even if it’s true, to take advantage of it would seem to require fine tuning β and I don’t see how to do that, given that trial-and-error wouldn’t be safe.
At the end of the day, it will be running some subroutine for its gain trust/predict accurately phase.
I assume this sort of thing is true for any model of computation, but when you construct a universal Turing machine, so that it can simulate computation step after computation step of another Turing machine, it takes way more than one computation step for each one. If the AGI is using machinery that would allow it to simulate any world-model, it will be way slower than the Turing machine built for that algorithm.
I realize this seems really in-the-weeds and particular, but I think this is a general principle of computation. The more general a system is, the less well it can do any particular task. I think an AGI that chose to pipe viable predictions to the output with some procedure will be slower than the Turing machine which just runs that procedure.
I don’t buy it. All your programs are already running on UTM M.
Just consider a program that gives the aliens the ability to write arbitrary functions in M and then pass control to them. That program is barely any bigger (all you have to do is insert one use after free in physics :) ), and guarantees the aliens have zero slowdown.
For the literal simplest version of this, your program is M(Alien(), randomness), which is going to run just as fast as M(physics, randomness) for the intended physics, and probably much faster (if the aliens can think of any clever tricks to run faster without compromising much accuracy). The only reason you wouldn’t get this is if Alien is expensive. That probably rules out crazy alien civilizations, but I’m with Wei Dai that it probably doesn’t rule out simpler scientists.
That’s what I was thinking too, but Michael made me realize this isn’t possible, at least for some M. Suppose M is the C programming language, but in C there is no way to say “interpret this string as a C program and run it as fast as a native C program”. Am I missing something at this point?
I don’t understand this sentence.
I agree this is only going to be possible for some universal Turing machines. Though if you are using a Turing machine to define a speed prior, this does seem like a desirable property.
If physics is implemented in C, there are many possible bugs that would allow the attacker to execute arbitrary C code with no slowdown.
Why is it a desirable property? I’m not seeing why it would be bad to choose a UTM that doesn’t have this property to define the speed prior for BoMAI, if that helps with safety. Please explain more?
I just mean: “universality” in the sense of a UTM isn’t a sufficient property when defining the speed prior, the analogous property of the UTM is something more like: “You can run an arbitrary Turing machine without too much slowdown.” Of course that’s not possible, but it seems like you still want to be as close to that as possible (for the same reasons that you wanted universality at all).
I agree that it would be fine to sacrifice this property if it was helpful for safety.
Each world-model is a Turing machine, whose prior relates to the Kolmogorov complexity (on some universal Turing machine) of the description of Turing machine—all the transition rules, and whatnot. Usually, this would be isomorphic (within a constant), but since we’re considering speed, programs actually aren’t simulated on a UTM.
Ok, I see, so in other words the AGI doesn’t have the ability to write an arbitrary function in the base programming language and call it, it has a fixed code base and has to simulate that function using its existing code. However I think the AGI can still win a race against a straightforward “predict accurately” algorithm, because it can to two things. 1) Include the most important inner loops of the “predict accurately” algorithm as functions in its own code to minimize the relative slowdown (this is not a decision by the AGI but just a matter of which AGI ends up having the highest posterior) and 2) keep finding improvements to its own prediction algorithm so that it can eventually overtake any fixed prediction algorithm in accuracy which hopefully more than “pays for” the remaining slowdown that is incurred.
Let the AGI’s “predict accurately” algorithm be fixed.
What you call a sequence of improvements to the prediction algorithm, let’s just call that the prediction algorithm. Imagine this to have as much or as little overhead as you like compared to what was previously conceptualized as “predict accurately.” I think this reconceptualization eliminates 2) as a concern, and if I’m understanding correctly, 1) is only able to mitigate slowdown, not overpower it.
Also I think 1) doesn’t work—maybe you came to this conclusion as well?
But maybe you’re saying that doesn’t apply because:
I think this way throws off the contention that this AGI will have a short description length. One can imagine a sliding scale here. Short description, lots of overhead: a simple universe evolves life, aliens decide to run “predict accurately” + “treacherous turn”. Longer description, less overhead: an AGI that runs “predict accurately” + “treacherous turn.” Longer description, less overhead: an AGI with some of the subroutines involved already (conveniently) baked in to its architecture. Once all the subroutines are “baked into its architecture” you just have: the algorithm “predict accurately” + “treacherous turn”. And in this form, that has a longer description than just “predict accurately”.
You only have to bake in the innermost part of one loop in order to get almost all the computational savings.
I’ve made a case that the two endpoints in the trade-off are not problematic. I’ve argued (roughly) that one reduces computational overhead by doing things that dissociate the naturalness of describing “predict accurately” and “treacherous turn” all at once. This goes back to the general principle I proposed above: “The more general a system is, the less well it can do any particular task.” The only thing I feel like I can still do is argue against particular points in the trade-off that you think are likely to cause trouble. Can you point me to an exact inner loop that can be native to an AGI that would cause this to fall outside of this trend? To frame this case, the Turing machine description must specify [AGI + a routine that it can call]--sort of like a brain-computer interface, where the AGI is the brain and the fast routine is the computer.
(I actually have a more basic confusion, started a new thread.)
Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.
Fine tuning from both sides isn’t safe. Approach from below.
I just read the math more carefully, and it looks like no matter how small β is, as long as β is positive, as BoMAI receives more and more input, it will eventually converge to the most accurate world model possible. This is because the computation penalty is applied to the per-episode computation bound and doesn’t increase with each episode, whereas the accuracy advantage gets accumulated across episodes.
Assuming that the most accurate world model is an exponential-time quantum simulation, that’s what BoMAI will converge to (no matter how small β is), right? And in the meantime it will go through some arbitrarily complex (up to some very large bound) but faster than exponential classical approximations of quantum physics that are increasingly accurate, as the number of episodes increase? If so, I’m no longer convinced that BoMAI is benign as long as β is small enough, because the qualitative behavior of BoMAI seems the same no matter what β is, i.e., it gets smarter over time as its world model gets more accurate, and I’m not sure why the reason BoMAI might not be benign at high β couldn’t also apply at low β (if we run it for a long enough time).
(If you’re going to discuss all this in your “longer reply”, I’m fine with waiting for it.)
The longer reply will include an image that might help, but a couple other notes. If it causes you to doubt the asymptotic result, it might be helpful to read the benignity proof (especially the proof of Rejecting the Simple Memory-Based Lemma, which isn’t that long). The heuristic reason for why it can be helpful to decrease β for long-run behavior, even though long-run behavior is qualitatively similar, is that while accuracy eventually becomes the dominant concern, along the way the prior is *sort of* a random perturbation to this which changes the posterior weight, so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length. Put another way, for benignity, we don’t need concern for speed to dominate concern for accuracy; we need it to dominate concern for “simplicity” (on some reference machine).
Yeah, I understand this part, but I’m not sure why, since the benign one can be extremely complex, the malign one can’t have enough of a K-complexity advantage to overcome its slowness penalty. And since (with low β) we’re going through many more different world models as the number of episodes increases, that also gives malign world models more chances to “win”? It seems hard to make any trustworthy conclusions based on the kind of informal reasoning we’ve been doing and we need to figure out the actual math somehow.
Check out the order of the quantifiers in the proofs. One β works for all possibilities. If the quantifiers were in the other order, they couldn’t be trivially flipped since the number of world-models is infinite, and the intuitive worry about malign world-models getting “more chances to win” would apply.
Let’s continue the conversation here, and this may be a good place to reference this comment.
Sure, approaching from below is obvious, but that still requires knowing how wide the band of β that would produce a safe and useful BoMAI is, otherwise even if the band exists you could overshoot it and end up in the unsafe region.
ETA: But the first question is, is there a β such that BoMAI is both safe and intelligent enough to answer questions like “how to build a safe unbounded AGI”? When β is very low BoMAI is useless, and as you increase β it gets smarter, but then at some point with a high enough β it becomes unsafe. Do you know a way to figure out how smart BoMAI is just before it becomes unsafe?
Some visualizations which might help with this:
But then one needs to factor in “simplicity” or the prior penalty from description length:
Note also that these are average effects; they are just for forming intuitions.
Your concern was:
This was the sort of thing I assumed could be improved upon later once the asymptotic result was established. Now that you’re asking for the improvement, here’s a proposal:
Set β safely. Once enough observations have been provided that you believe human-level AI should be possible, exclude world-models that use less than s←1 computation steps per episode. Every episode, increase s until human-level performance is reached. Under the assumption that the average computation time of a malign world-model is at least a constant times that of the “corresponding” benign one (corresponding in the sense of using the same ((coarse) approximate) simulation of the world), then s←αs should be safe for some α>1 (and α−1≉0).
I need to think more carefully about what happens here, but I think the design space is large.
Fixed your images. You have to press space after you use that syntax for the images to actually get fetched and displayed. Sorry for the confusion.
Thanks!
Longer response coming. On hold for now.
All the better. They don’t what know universe is using the prior. What are the odds our universe is the single most susceptible universe to being taken over?
I was assuming the worst, and guessing that there are diminishing marginal returns once your odds of a successful takeover get above ~50%, so instead of going all in on accurate predictions of the weakest and ripest target universe, you hedge and target a few universes. And I was assuming the worst in assuming they’d be so good at this, they’d be able to do this for a large number of universes at once.
To clarify: diminishing marginal returns of takeover probability of a universe with respect to the weight you give that universe in your prior that you pipe to output.
There are massive diminishing marginal returns; in a naive model you’d expect essentially *every* universe to get predicted in this way.
But Wei Dai’s basic point still stands. The speed prior isn’t the actual prior over universes (i.e. doesn’t reflect the real degree of moral concern that we’d use to weigh consequences of our decisions in different possible worlds). If you have some data that you are trying to predict, you can do way better than the speed prior by (a) using your real prior to estimate or sample from the actual posterior distribution over physical law, (b) using engineering reasoning to make the utility maximizing predictions, given that faster predictions are going to get given more weight.
(You don’t really need this to run Wei Dai’s argument, because there seem to be dozens of ways in which the aliens get an advantage over the intended physical model.)
I think what you’re saying is that the following don’t commute:
“real prior” (universal prior) + speed update + anthropic update + can-do update + worth-doing update
compared to
universal prior + anthropic update + can-do update + worth-doing update + speed update
When universal prior is next to speed update, this is naturally conceptualized as a speed prior, and when it’s last, it is naturally conceptualized as “engineering reasoning” identifying faster predictions.
I happy to go with the second order if you prefer, in part because I think they do commute—all these updates just change the weights on measures that get mixed together to be piped to output during the “predict accurately” phase.
You have a countable list of options. What choice do you have but to favor the ones at the beginning? Any (computable) permutation of the things on the list just corresponds to a different choice of universal Turing machine for which a “short” algorithm just means it’s earlier on the list.
And a “sequence of increasingly better algorithms,” if chosen in a computable way, is just a computable algorithm.
The fast algorithms to predict our physics just aren’t going to be the shortest ones. You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.
Using “reasoning” to pick which one to favor, is just picking the first one in some new order. (And not really picking the first one, just giving earlier ones preferential treatment). In general, if you have an infinite list of possibilities, and you want to pick the one that maximizes some property, this is not a procedure that halts. I’m agnostic about what order you use (for now) but one can’t escape the necessity to introduce the arbitrary criterion of “valuing” earlier things on the list. One can put 50% probability mass on the first billion instead of the first 1000 if one wants to favor “simplicity” less, but you can’t make that number infinity.
Yes, some new order, but not an arbitrary one. The resulting order is going to be better than the speed prior order, so we’ll update in favor of the aliens and away from the rest of the speed prior.
Probably some miscommunication here. No one is trying to object to the arbitrariness, we’re just making the point that the aliens have a lot of leverage with which to beat the rest of the speed prior.
(They may still not be able to if the penalty for computation is sufficiently steep—e.g. if you penalize based on circuit complexity so that the model might as well bake in everything that doesn’t depend on the particular input at hand. I think it’s an interesting open question whether that avoids all problems of this form, which I unsuccessfully tried to get at here.)
It was definitely reassuring to me that someone else had had the thought that prioritizing speed could eliminate optimization daemons (re: minimal circuits), since the speed prior came in here for independent reasons. The only other approach I can think of is trying to do the anthropic update ourselves.
If you haven’t seen Jessica’s post in this area, it’s worth taking a quick look.
The only point I was trying to respond to in the grandparent of this comment was your comment
Your concern (I think) is that our speed prior would assign a lower probability to [fast approximation of real world] than the aliens’ speed prior.
I can’t respond at once to all of the reasons you have for this belief, but the one I was responding to here (which hopefully we can file away before proceeding) was that our speed prior trades off shortness with speed, and aliens could avoid this and only look at speed.
My point here was just that there’s no way to not trade off shortness with speed, so no one has a comparative advantage on us as result of the claim “The fast algorithms to predict our physics just aren’t going to be the shortest ones.”
The “after figuring out physics” part is like saying that they can use a prior which is updated based on evidence. They will observe evidence for what our physics is like, and use that to update their posterior, but that’s exactly what we’re doing to. The prior they start with can’t be designed around our physics. I think that the only place this reasoning gets you is that their posterior will assign a higher probability to [fast approximation of real world] than our prior does, because the world-models have been reasonably reweighted in light of their “figuring out physics”. Of course I don’t object to that—our speed prior’s posterior will be much better than the prior too.
It seems totally different from what we’re doing, I may be misunderstanding the analogy.
Suppose I look out at the world and do some science, e.g. discovering the standard model. Then I use my understanding of science to design great prediction algorithms that run fast, but are quite complicated owing to all of the approximations and heuristics baked into them.
The speed prior gives this model a very low probability because it’s a complicated model. But “do science” gives this model a high probability, because it’s a simple model of physics, and then the approximations follow from a bunch of reasoning on top of that model of physics. We aren’t trading off “shortness” for speed—we are trading off “looks good according to reasoning” for speed. Yes they are both arbitrary orders, but one of them systematically contains better models earlier in the order, since the output of reasoning is better than a blind prioritization of shorter models.
Of course the speed prior also includes a hypothesis that does “science with the goal of making good predictions,” and indeed Wei Dai and I are saying that this is the part of the speed prior that will dominate the posterior. But now we are back to potentially-malign consequentialistism. The cognitive work being done internally to that hypothesis is totally different from the work being done by updating on the speed prior (except insofar as the speed prior literally contains a hypothesis that does that work).
In other words:
Suppose physics takes n bits to specify, and a reasonable approximation takes N >> n bits to specify. Then the speed prior, working in the intended way, takes N bits to arrive at the reasonable approximation. But the aliens take n bits to arrive at the standard model, and then once they’ve done that can immediately deduce the N bit approximation. So it sure seems like they’ll beat the speed prior. Are you objecting to this argument?
(In fact the speed prior only actually takes n + O(1) bits, because it can specify the “do science” strategy, but that doesn’t help here since we are just trying to say that the “do science” strategy dominates the speed prior.)
I’m not sure which of these arguments will be more convincing to you.
This is what is what I was trying to contextualize above. This is an unfair comparison. You’re imagining that the “reasoning”-based order gets to see past observations, and the “shortness”-based order does not. A reasoning-based order is just a shortness-based order that has been updated into a posterior after seeing observations (under the view that good reasoning is Bayesian reasoning). Maybe the term “order” is confusing us, because we both know it’s a distribution, not an order, and we were just simplifying to a ranking. A shortness-based order should really just be called a prior, and a reasoning-based order (at least a Bayesian-reasoning-based order) should really just be called a posterior (once it has done some reasoning; before it has done the reasoning, it is just a prior too). So yes, the whole premise of Bayesian reasoning is that updating based on reasoning is a good thing to do.
Here’s another way to look at it.
The speed prior is doing the brute force search that scientists try to approximate efficiently. The search is for a fast approximation of the environment. The speed prior considers them all. The scientists use heuristics to find one.
Exactly. But this does help for reasons I describe here. The description length of the “do science” strategy (I contend) is less than the description length of the “do science” + “treacherous turn” strategy. (I initially typed that as “tern”, which will now be the image I have of a treacherous turn.)
Reasoning gives you a prior that is better than the speed prior, before you see any data. (*Much* better, limited only by the fact that the speed prior contains strategies which use reasoning.)
The reasoning in this case is not a Bayesian update. It’s evaluating possible approximations *by reasoning about how well they approximate the underlying physics, itself inferred by a Bayesian update*, not by directly seeing how well they predict on the data so far.
I can reply in that thread.
I think the only good arguments for this are in the limit where you don’t care about simplicity at all and only care about running time, since then you can rule out all reasoning. The threshold where things start working depends on the underlying physics, for more computationally complex physics you need to pick larger and larger computation penalties to get the desired result.
Given a world model ν, which takes k computation steps per episode, let νlog be the best world-model that best approximates ν (in the sense of KL divergence) using only logk computation steps.νlog is at least as good as the “reasoning-based replacement” of ν.
The description length of νlog is within a (small) constant of the description length of ν. That way of describing it is not optimized for speed, but it presents a one-time cost, and anyone arriving at that world-model in this way is paying that cost.
One could consider instead νlogε, which is, among the world-models that ε-approximate ν in less than logk computation steps (if the set is non-empty), the first such world-model found by a searching procedure ψ. The description length of νlogε is within a (slightly larger) constant of the description length of ν, but the one-time computational cost is less than that of νlog.
νlog, νlogε, and a host of other approaches are prominently represented in the speed prior.
If this is what you call “the speed prior doing reasoning,” so be it, but the relevance for that terminology only comes in when you claim that “once you’ve encoded ‘doing reasoning’, you’ve basically already written the code for it to do the treachery that naturally comes along with that.” That sense of “reasoning” really only applies, I think, to the case where our code is simulating aliens or an AGI.
(ETA: I think this discussion depended on a detail of your version of the speed prior that I misunderstood.)
To be clear, that description gets ~0 mass under the speed prior, right? A direct specification of the fast model is going to have a much higher prior than a brute force search, at least for values of β large enough (or small enough, however you set it up) to rule out the alien civilization that is (probably) the shortest description without regard for computational limits.
Within this chunk of the speed prior, the question is: what are good ψ? Any reasonable specification of a consequentialist would work (plus a few more bits for it to understand its situation, though most of the work is done by handing it ν), or of a petri dish in which a consequentialist would eventually end up with influence. Do you have a concrete alternative in mind, which you think is not dominated by some consequentialist (i.e. a ψ for which every consequentialist is either slower or more complex)?
Well one approach is in the flavor of the induction algorithm I messaged you privately about (I know I didn’t give you a completely specified algorithm). But when I wrote that, I didn’t have a concrete algorithm in mind. Mostly, it just seems to me that the powerful algorithms which have been useful to humanity have short descriptions in themselves. It seems like there are many cases where there is a simple “ideal” approach which consequentialists “discover” or approximately discover. A powerful heuristic search would be one such algorithm, I think.
I don’t think anything here changes if K(x) were replaced with S(x) (if that was what you misunderstood).
True but I’m arguing that this computable algorithm is just the alien itself, trying to answer the question “how can I better predict this richer world in order to take it over?” If there is no shorter/faster algorithm that can come up with a sequence of increasingly better algorithms, what is the point of saying that the alien is sampling from the speed prior, instead of saying that the alien is thinking about how to answer “how can I better predict this richer world in order to take it over?” Actually if this alien was sampling from the speed prior, then it would no longer be the shortest/fastest algorithm to come up with a sequence of increasingly better algorithms, and some other alien trying to take over our world would have the highest posterior instead.
I’m having a hard time following this. Can you expand on this, without using “sequence of increasingly better algorithms”? I keep translating that to “algorithm.”
Comment thread: concerns with “the box”
Just obvious and mundane concerns:
You might want to make clearer that “As long as the door is closed, information cannot leave the room” isn’t an assumption but a requirement of the setup. I.e. that you’re not assuming based on your description that opening the door is the only means for an operator to get information out; you’re assuming every other means of information escape has been systematically accounted for and ruled out (with the assumption that the operator has been compromised by the AI).
It would probably help to put the AI in a cryptographic box as well (so its only output is text). See https://www.lesswrong.com/posts/2Wf3R4NZ77CLczLL2/cryptographic-boxes-for-unfriendly-ai
Responding to this comment here.
I didn’t go into the details of the physical design for a really secure box in the paper, because it’s not really a topic in computer science, but this is a good excuse to write it up.
The operator is inside a glass box. Outside the glass box is a vacuum, and then a concrete box. Outside the concrete box is another vacuum and another concrete box with a Faraday cage around it. There are (airtight) doors through each layer. There is a pressure sensor in both the inner and outer vacuum layer. The computer lives in the inner wall of the inner concrete box; the screen is visible from the inside of the glass box. If the pressure increases in either vacuum layer, the computer’s memory is erased, and the whole thing is blown up. There is a laser shining through the glass box (Laser 1). If the operator blocks the path of Laser 1, this corresponds to pressing the button which ends the episode and opens the door. The pressure sensor → dynamite pathway is deactivated, the vacuum layers are filled with air, and the doors are unlocked. The operator “types” observations and rewards by blocking other lasers that pass through the box. For now, just imagine a few dozen lasers in a row, that the operator can reach up and block in order to type. (With a bit of creativity, and requiring that two lasers be blocked at once to “press a key” this could probably be made somewhat like a qwerty setup.) The lasers and the laser-detectors are all on the inner wall of the inner concrete box.
So inside the box is only a human body and some clothes. Let’s assume: 1) the only ways for information to move in space are with a) physically moving matter, b) light, and c) gravitational waves and 2) it is impossible for a human being to convert a nuclear reaction into useful work with only their body and clothes. Physically moving packets matter cannot leave the inside of the glass box without releasing air into the vacuum and causing an explosion. Light cannot leave the Faraday cage. For sound waves (through an imperfect vacuum) and gravitational waves, we can use the amount of usable energy stored by the person in the room to bound how much noise and randomly moving matter we need to be producing outside to bury the signal in the noise. With a sufficiently large amount of noise, even if all the human’s energy went toward producing a signal, we can create enough noise that this will not be detectable. Furthermore, an analysis of the near-perfect vacuum can provide an upper bound on the efficiency with which a human can convert energy into an informative matter wave, which should drive down the amount of noise we need to produce to almost nothing. We can also put a decibel meter in the inner wall and blow the whole thing up if it gets too high. I haven’t done the math, but I bet the amount of matter we need to be randomly moving in order to hide signals coming from gravitational waves originating from within the box really is nothing. I bet ambient temperature takes care of that.
What if there are other ways information can be transmitted through space? If we ever learn of these, we shut down BoMAI (or modify the setup in light of them). Until then, if such a discovery is made by BoMAI, we can’t exactly detect those sorts of signals from the outside, because we don’t know they exist.
This is pretty thorough from an engineering standpoint. I still would not at all trust such a box to hold a superintelligent AI. The unknown unknowns remain too big a factor. Just because an unknown information channel does not have any humans listening to it directly, does not mean that it can’t be used to push around matter, or couple to some other channel, or otherwise leave an opening.
Also, the human and their clothes are definitely not the only potentially-useful tools in the box: there’s also the computer, the computer’s power supply, whatever’s cooling the room, whatever’s keeping the air fresh, whatever’s used for input, and whatever’s used for output. If any of those things were outside the room, then they would offer a side channel for information to escape. They can be buried in concrete, but they’ll still need to be within the vacuums and whatnot.
In general, trying to outsmart a superintelligent AI is a Bad Idea, and building an AI-proof box definitely falls in that category. If an AI is not safe with arbitrarily large amounts of power, then it is not safe, full stop.
The computer and everything is in the inner concrete wall (separated from the glass box by a vacuum), as is the power supply. Nothing is cooling is the room, except maybe some ice on the floor. I think you could allow an oxygen tank in the room too.
Yes, the computer is using power, but a computer can’t move matter except inside itself. This can generate sound and light, but the second vacuum chamber and Faraday cage will block that (and the rate at which power is drawn can be capped, which can also give a reasonable bound on how much noise to generate.
For input, the human is just blocking lasers. For output, they’re looking through the inner vacuum at the screen on the inner wall of the inner concrete box.
Yes, that’s what the outer vacuum is for.
If I can construct a proof that Y is physically impossible, then I feel fine depending on the fact that an AI won’t be able to figure out how to do Y. It doesn’t feel to me like “trying to outsmart an AI.”
Yes, you’re right. My afterthought was hasty. I still think it is unlikely that other forms of information transmission are physically possible, and quite unlikely that they could be generated by a human or a computer in isolation.
Sidenote: I think we’ve all generated a lot of our intuitions here from the AI Box experiments. In a sense, the AI “in the box” is not really boxed. There’s an information channel directly to the outside world!
We cannot “prove” that something is physically impossible, only that it is impossible under some model of physics. Normally that distinction would be entirely irrelevant, but when dealing with a superintelligent AI, it’s quite likely to understand the physics better than we do. For all we know, it may turn out that Alcubierre drives are possible, and if so then the AI could definitely break out that way and would have an incentive to do so.
I agree that the AI is not really boxed here; it’s the “myopia” that makes the difference. But one of two things should generally be true:
The AI doesn’t want to get out of the box, in which case the box doesn’t need to be secure in the first place.
The AI cannot get out of the box, in which case the AI doesn’t need to be safe (but also won’t be very useful).
This case seems like the former, so long as hacking the human is easier than getting out of the box. But that means we don’t need to make the box perfect anyway.
Whoops—when I said
I meant the “AI Box” scenario where it is printing results to a screen in the outside world. I do think BoMAI is truly boxed.
Right, that’s more or less what I mean to do. We can assign probabilities to statements like “it is physically impossible (under the true models of physics) for a human or a computer in isolation with an energy budget of x joules and y joules/second to transmit information in any way other than via a), b), or c) from above.” This seems extremely likely to me for reasonable values of x and y, so it’s still useful to have a “proof” even if it must be predicated on such a physical assumption.
I strongly disagree with this quote (and would like to know how to point this out!):
I have never seen anyone point out that another’s thoughts were wrong, because they were too abstract, and that they were harmful to the general audience. I have seen three comments advocating for a specific model of human values, which I have never seen anybody; but at the moment I have not seen anyone anywhere in that context anywhere.
This isn’t because it is wrong, but because it doesn’t really sound like a person who would care, even if the AI were not going to see him do his work.
This is to me, the more compelling argument in terms of What if “AIs” might end up being the type that can decide whether to take over, then there isn’t a reasonable way for AIs to have any conscious thoughts.
The idea that AGI is coming soon isn’t obviously right. It looks like we already are. I don’t want to live in a world with lots of AIs over, not enough to make them “free” and not yet understand the basic principles of utility.
I can’t see how you can say that such a scenario is impossible, since the AI would simply be a kind of computer. However, this argument depends on your definition of AI as a “mind with 1” (a mind of a single type).
Just a minor nitpick: I think the basic rules of this sort of situation are good. I have a friend who does some pretty cool stuff with a simple set of rules, but it’s rare to get to a point where each rule is clearly valid.
One thing I would like to add is that while it’s definitely worth the paper I’ll write the paper, it’s not easy to find a better way to produce them. An example would be writing an open letter to one of the fellows: someone who was very excited about giving up the $100 could get it from them.
Comment thread: concerns with Assumption 4
Still wrapping my head around the paper, but...
1) It seems too weak: In the motivating scenario of Figure 3, isn’t is the case that “what the operator inputs” and “what’s in the memory register after 1 year” are “historically distributed identically”?
2) It seems too strong: aren’t real-world features and/or world-models “dense”? Shouldn’t I be able to find features arbitrarily close to F*? If I can, doesn’t that break the assumption?
3) Also, I don’t understand what you mean by: “it’s on policy behavior [is described as] simulating X”. It seems like you (rather/also) want to say something like “associating reward with X”?
This assumption isn’t necessary to rule out memory-based world-models (see Figure 4). And yes you are correct that indeed it doesn’t rule them out.
Yes. Yes. No. There are only finitely many short English sentences. (I think this answers your concern if I understand it correctly).
I don’t quite rely on the latter. Associating reward with X means that the rewards are distributed identically to X under all action sequences. Instead, the relevant implication here is: “the world-model’s on-policy behavior can be described as simulating X” implies “for on-policy action sequences, the world-model simulates X” which means “for on-policy action sequences, rewards are distributed identically to X.”
Also, it’s worth noting that this assumption (or rather, Lemma 3) also seems to preclude BoMAI optimizing anything *other* than revealed preferences (which others have noted seems problematic, although I think it’s definitely out of scope).
I don’t understand what you mean by a revealed preference. If you mean “that which is rewarded,” then it seems pretty straightforward to me that a reinforcement learner can’t optimize anything other than that which is rewarded (in the limit).
Yes, that’s basically what I mean. I think I’m trying to refer to the same issue that Paul mentioned here: https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#ZWtTvMdL8zS9kLpfu
That’s why I said the “right” thing to do if you asked about cryonics “I will give you something to deny. I’ll create a perfect reality and you will be cured afterward.”
Comment thread: concerns with Assumption 3
I’m calling this the “no grue assumption” (https://en.wikipedia.org/wiki/New_riddle_of_induction).
My concern here is that this assumption might be False, even in a strong sense of “There is no such U”.
Have you proven the existence of such a U? Do you agree it might not exist? It strikes me as potentially running up against issues of NFL / self-reference.
That’s a good name for the assumption. Well, any Turing machine/computable function can be described in English (perhaps quite arduously), so consider the universal Turing machine which converts the binary description to English, and then uses that description to identify the Turing machine to simulate. This UTM certainly satisfies this assumption.
Can you explain more? (If the above doesn’t answer it).
Another intuition I have for this assumption which doesn’t appear in the paper: English is really good language. (This is admittedly vague). In thinking about this intuition further, I’ve noticed a weaker form of Assumption 3 that would also do the trick: the assumption need only hold for ε-accurate world-models (for some ε>0). In that version of the assumption, one can use the more plausible intuitive justification: “English is a really good language for describing events arising from human-civilization in our universe.”
Wei Dai’s response to this comment represents a concern with Assumption 3.
Comment thread: concerns with Assumption 2
Let the set of potential human explorer actions be AE, and the best human explorer action be a∗E with reward r∗E. Consider the following world model. When asked to predict the result of an action a, it simulates it to find the predicted observation o and reward r. If a∈AE, it outputs o and r faithfully. Otherwise, it outputs any reward it chooses, as long as for the action that it assigns the maximum reward to, it reports r faithfully. In practice, this means that the world model can get BoMAI to choose any action it wants, as long as it is at least as good as the human explorer’s best action. This is bad if the world model has malicious inner optimizers.
I believe that Assumption 2 is the one responsible for ruling out a model of this form. It seems probably reasonable to say that for actions where it continues simulating the outside world there’s useless computation. However, it can also save computation relative to μ∗: while μ∗ must predict o and r perfectly for all actions a, this model can immediately output a null observation and zero reward for any a∉AE that it knows will not align with its goals, rather than spending computation to simulate what rewards those actions would get. Another way of thinking about this is that this model uses consequentialist general intelligence to quickly prune away uninteresting non-human actions to save on computation, but that general intelligence comes at the price of misaligned goals + deceptive behavior.
Also, from this comment:
I think the model above has arbitrarily bad off-policy predictions, and it’s not implausible for it to be the MAP world model forever.
This is an interesting world-model.
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
Only the on-policy computation is accounted for.
Actually, I’m not sure if the world model I described is memory-based. EDIT: Never mind, see Michael’s comment below, the non-benign ones are memory-based.
The rewards it outputs are correct, except when it says “the reward is zero”, but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?
(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I’m postulating that the world model itself has malicious goals and is manipulating BoMAI.)
[Quite possibly I’m confused, but in case I’m not:]
I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).
The AI has an incentive to understand the operator’s mind, since this bears directly on its reward.
Better understanding the operator’s mind might be achieved in part by running simulations including the operator.
One specific simulation would involve simulating the operator’s environment and actions after he leaves the room.
Here this isn’t done to understand the implications of his actions (which can’t affect the episode); it’s done to better understand his mind (which can).
In this way, one branch of forget/not-forget has two useful purposes (better understand mind and simulate future), while the other has one (better understand mind). So a malign memory-based model needn’t be slower than a benign model, if it’s useful for that benign model to simulate the future too.
So either I’m confused, or the justification for the assumption isn’t valid. Hopefully the former :).
If I’m right, then what you seem to need is an assumption that simulating the outside-world’s future can’t be helpful in the AI’s prediction of its reward. To me, this seems like major hand-waving territory.
I wouldn’t really use the term “incentives” to describe the free-for-all among world-models as they compete to be maximum a posteriori. All they have to do is output observations and rewards in a distribution that matches the objective probabilities. But I think we arrive at the same possibility: you’ll see in the algorithm for ν⋆ that it does simulate the outside-world.
I do acknowledge in the paper that some of the outside-world simulation that a memory-based world-model does when it’s following the “wrong path” may turn out to be useful; all that is required for the argument to go through is that this simulation is not perfectly useful—there is a shorter computation that accomplishes the same thing.
I would love it if this assumption could look like: “the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual” and make assumption 2 into a lemma that follows from it, but I couldn’t figure out how to formalize this.
Ah yes—I was confusing myself at some point between forming and using a model (hence “incentives”).
I think you’re correct that “perfectly useful” isn’t going to happen. I’m happy to be wrong.
I don’t think you’d be able to formalize this in general, since I imagine it’s not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you’d still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)
To rule this kind of thing out, I think you’d need more specific assumptions (e.g. physics-based).
This doesn’t seem to address what I view as the heart of Joe’s comment. Quoting from the paper:
“Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action”.
It seems like simulating *post-episode* events in particular would be useful for predicting the human’s responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for “revealed preferences” reasons, though).
Moreover, I think in practice we’ll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human’s model of it) potentially more appealing, at least in some cases.
So this is the sense in which I think my statement is technically correct. This is what μ⋆ literally does.
The next question is whether it is correct in way that isn’t fragile once we start considering fast/simple approximations of μ⋆. You’re right that there is more to discuss here than I discuss in the paper: if a human’s simulation of the future has ε fidelity, and the world-model itself has ∼ε fidelity, then a clever memory-based world-model could reuse the computation of the human’s prediction of the future when it is computing the actual future. If it hasn’t spent much computation time “going down the wrong path” there isn’t much that’s lost for having done so.
I don’t expect the human operator will be simulating/imagining all post-episode events that are relevant for ε-accurate predictions of future episodes. ε-accurate world-models have to simulate all the outside-world events that are necessary to get within an ε threshold of understanding how episodes affect each other, and it won’t be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won’t be perfectly useful to future computation.
So it seems like you have a theory that could collapse the human value system into an (mostly non-moral) “moral value system” (or, as Eliezer would put it, “the moral value system”)
(Note that I am not asserting that the moral value system (or the human metaethics) is necessarily stable—or that there’s a good and bad reason for not to value things in the first place.)
A few background observations:
A very few “real world” situations would be relevant here.
As an example, the following possible worlds are very interesting but I will focus on a couple:
The micro class and the macro class seem fairly different at first glance.
There is a very different class of micro-worlds available from a relatively small amount of resources.
The following world hypothetical would be clearly very different from the usual, and that looks very different than there’s a vastly smaller class of micro-worlds available to the same amount of resources.
At first I assumed that they were entirely plausible worlds. Then I assumed they were plausible to me.
Then I assumed there’s an overall level of plausibility that different people really do but have the same probability mass and the same amount of energy/effort.
The above causal leap isn’t that much of an argument.
The following examples, taken from Eliezer:
(It seems like Eliezer’s assumption of an “intended life”, in the sense of a non-extended life, is simply not true)
These seem to be completely reasonable and reasonably frequent enough that I’m reasonably sure they’re reasonable.
“In a world that never presents itself, there is no reason for this to be a problem.”
(A quick check of self-reference and how that’s not what it’s about seem relevant, though this sounds to me like a strawman.)
Comment thread: adding to the prize pool
If you would like to contribute, please comment with the amount. If you have venmo, please send the amount to @Michael-Cohen-45. If not, we can discuss.
Comment thread: minor concerns
Just exposition-wise, I’d front-load pi^H and pi^* when you define pi^B, and also clarify then that pi^B considers human-exploration as part of it’s policy.
″ This result is independently interesting as one solution to the problem of safe exploration with limited oversight in nonergodic environments, which [Amodei et al., 2016] discus ”
^ This wasn’t super clear to me.… maybe it should just be moved somewhere else in the text?
I’m not sure what you’re saying is interesting here. I guess it’s the same thing I found interesting, which is that you can get sufficient (and safe-as-a-human) exploration using the human-does-the-exploration scheme you propose. Is that what you mean to refer to?
Yeah that’s what I mean to refer to: this is a system which learns everything it needs to from the human while querying her less and less, which makes human-lead exploration viable from a capabilities standpoint. Do you think that clarification would make things clearer?
ETA: NVM, what you said is more descriptive (I just looked in the appendix).
RE footnote 2: maybe you want to say “monotonically increasing as a function of” rather than “proportional to”. (It’s a shame there doesn’t seem to be a shorter way of saying the first one, which seems to be more often what people actually want to say...)
Maybe “promotional of” would be a good phrase for this.
Is this where typos go?
Typo: some of the hover-boxes say nu but seem to be referring to the letter mu.
Thank you, I’ll have to clarify that. For now, ν is a general world-model, and μ is a specific one, so in the hover text, I explain the notation with a general case. But I see how that’s confusing.
Yes, but this is also for things that seem like mistakes in the exposition, but either have simple fixes or don’t impact the main theorems.
I’m not going on to talk about this topic because it is probably incredibly important to not just know about the problem you are solving. I am aware of the problem you are trying to solve. Perhaps it will be even harder to solve?
I will note that if you are trying to prove that a solution is feasible, you are probably not going to make any sort of breakthrough. It is not obvious that you can come up with a new breakthrough. Some people think that is because you are trying to prove it is possible, and then you come up with a new example.
I am of the opinion that this is only moderately useful. If you are trying to build a powerful mind, you are basically wasting your resources. If you are trying to get a large amount of power without the resources to explore the world, you may need to develop a way to generate power for others without making a breakthrough. To get an idea, start and explore it. No one can do this on my own.
The idea I am trying to promote is that the world of mathematics should be understood as a computer science—by the time you have a basic understanding of mathematics, you can build powerful minds.
I’m hoping that you can offer a convincing case that this is the only way to learn how to design a mind, by the way. I would like to propose something that is not just a little bit more realistic but has the opposite effect. It is perhaps the most difficult and interesting part of mathematics.
It would be interesting to hear whether you have any ideas that might help at such a fundamental level.
Please let me know if I am confused.