My worry at this point is that if simulating the real world using actual physics takes exponential time on your UTM, the world model with the greatest posterior may not be such a simulation but instead for example an alien superintelligence that runs efficiently on a classical TM which is predicting the behavior of the operator (using various algorithms that it came up with that run efficiently on a classical computer) and at some point the alien superintelligence will cause BoMAI to output something to mind hack the operator and then take over our universe. I’m not sure which assumption this would violate, but do you see this as a reasonable concern?
The theorem is consistent with the aliens causing trouble any finite number of times. But each time they cause the agent to do something weird their model loses some probability, so there will be some episode after which they stop causing trouble (if we manage to successfully run enough episodes without in fact having anything bad happen in the meantime, which is an assumption of the asymptotic arguments).
Thanks. Is there a way to derive a concrete bound on how long it will take for BoMAI to become “benign”, e.g., is it exponential or something more reasonable? (Although if even a single “malign” episode could lead to disaster, this may be only of academic interest.) Also, to comment on this section of the paper:
“We can only offer informal claims regarding what happens
before BoMAI is definitely benign. One intuition is that eventual
benignity with probability 1 doesn’t happen by accident:
it suggests that for the entire lifetime of the agent, everything
is conspiring to make the agent benign.”
If BoMAI can be effectively controlled by alien superintelligences before it becomes “benign” that would suggest “everything
is conspiring to make the agent benign” is misleading as far as reasoning about what BoMAI might do in the mean time.
(if we manage to successfully run enough episodes without in fact having anything bad happen in the meantime, which is an assumption of the asymptotic arguments)
Is this noted somewhere in the paper, or just implicit in the arguments? I guess what we actually need is either a guarantee that all episodes are “benign” or a bound on utility loss that we can incur through such a scheme. (I do appreciate that “in
the absence of any other algorithms for general intelligence
which have been proven asymptotically benign, let alone benign
for their entire lifetimes, BoMAI represents meaningful
theoretical progress toward designing the latter.”)
Is there a way to derive a concrete bound on how long it will take for BoMAI to become “benign”, e.g., is it exponential or something more reasonable?
The closest thing to a discussion of this so far is Appendix E, but I have not yet thought through this very carefully. When you ask if it is exponential, what exactly are you asking if it is exponential in?
When you ask if it is exponential, what exactly are you asking if it is exponential in?
I guess I was asking if it’s exponential in anything that would make BoMAI impractically slow to become “benign”, so basically just using “exponential” as a shorthand for “impractically large”.
If BoMAI can be effectively controlled by alien superintelligences before it becomes “benign” that would suggest “everything is conspiring to make the agent benign” is misleading as far as reasoning about what BoMAI might do in the mean time.
Agreed that would be misleading, but I don’t think it would be controlled by alien superintelligences.
the world model with the greatest posterior may not be such a simulation but instead for example an alien superintelligence that runs efficiently on a classical TM which is predicting the behavior of the operator
Consider algorithm the alien superintelligence is running to predict the behavior of the operator which runs efficiently on a classical TM (Algorithm A). Now compare Algorithm A with Algorithm B: simulate aliens deciding to run algorithm A; run algorithm A; except at some point, figure out when to do a treacherous turn, and then do it.
Algorithm B is clearly slower than Algorithm A, so Algorithm B loses.
There is an important conversation to be had here: your particular example isn’t concerning, but maybe we just haven’t thought of an analog that is concerning. Regardless, I think has become divorced from the discussion about quantum mechanics.
This is why I try to write down all the assumptions to rule out a whole host of world-models we haven’t even considered. In the argument in the paper, the assumption that rules out this example is the Natural Prior Assumption (assumption 3), although I think for your particular example, the argument I just gave is more straightforward.
Yes but algorithm B may be shorter than algorithm A, because it could take a lot of bits to directly specify an algorithm that would accurately predict a human using a classical computer, and less bits to pick out an alien superintelligence who has an instrumental reason to invent such an algorithm. If β is set to be so near 1 that the exponential time simulation of real physics can have the highest posterior within a reasonable time, the fact that B is slower than A makes almost no difference and everything comes down to program length.
Regardless, I think has become divorced from the discussion about quantum mechanics.
Quantum mechanics is what’s making B being slower than A not matter (via the above argument).
If β is set to be so near 1 that the exponential time simulation of real physics can have the highest posterior within a reasonable time...
Epistemic status: shady
So I’m a bit baffled by the philosophy here, but here’s why I haven’t been concerned with the long time it would take BoMAI to entertain the true environment (and it might well, given a safe value of β).
There is relatively clear distinction one can make between objective probabilities and subjective ones. The asymptotic benignity result makes use of world-models that perfectly match the objective probabilities rising to the top.
Consider a new kind of probability: a “k-optimal subjective probability.” That is, the best (in the sense of KL divergence) approximation of the objective probabilities that can be sampled from using a UTM and using only k computation steps. Suspend disbelief for a moment, and suppose we thought of these probabilities as objective probabilities. My intuition here is that everything works just great when agents treat subjective probabilities like real probabilities, and to a k-bounded agent, it feels like there is some sense in which these might as well be objective probabilities; the more intricate structure is inaccessible. If no world-models were considered that allowed more than k computation steps per timestep (mk per episode I guess, whatever), then just by calling “k-optimal subjective probabilities” “objective,” the same benignity theorems would apply, where the role in the proofs of [the world-model that matches the objective probabilities] is replaced by [the world-model that matches the k-optimal subjective probabilities]. And in this version, i0 comes much sooner, and the limiting value of intelligence is reached much sooner.
Of course, “the limiting value of intelligence” is much less, because only fast world-models are considered. But that just goes to show that even if, on a human timescale, BoMAI basically never fields a world-model that actually matches objective probabilities, along the way, it will still be fielding the best ones available that use a more modest computation budget. Once the computation budget surpasses the human brain, that should suffice for it to be practically intelligent.
EDIT: if one sets β to be safe, then if this logic fails, BoMAI will be useless, not dangerous.
If there’s an efficient classical approximation of quantum dynamics, I bet this has a concise and lovely mathematical description. I bet that description is much shorter than “in Conway’s game of life, the efficient approximation of quantum mechanics that whatever lifeform emerges will probably come up with.”
But I’m hesitant here. This is exactly the sort of conversation I wanted to have.
If there’s an efficient classical approximation of quantum dynamics, I bet this has a concise and lovely mathematical description.
I doubt that there’s an efficient classical approximation of quantum dynamics in general. There are probably tricks to speed up the classical approximation of a human mind though (or parts of a human mind), that an alien superintelligence could discover. Consider this analogy. Suppose there’s a robot stranded on a planet without technology. What’s the shortest algorithm for controlling the robot such that it eventually leaves that planet and reaches another star? It’s probably some kind of AGI that has an instrumental goal of reaching another star, right? (It could also be a terminal goal, but there are many other terminal goals that call for interstellar travel as an instrumental goal so the latter seems more likely.) Leaving the planet calls for solving many problems that come up, on the fly, including inventing new algorithms for solving them. If you put all these individual solutions and algorithms together that would also be an algorithm for reaching another star but it could be a lot longer than the code for the AGI.
I see—so I think I make the same response on a different level then.
My model for this is: the world-model is a stochastic simple world, something like Conway’s game of life (but with randomness). Life evolves. The output channel has distinguished within-world effects, so that inhabitants can recognize it. The inhabitants control the output channel and use some of their world’s noise to sample from a universal prior, which they then feed into the output channel. But they don’t just use any universal prior—they use a better one, one which updates the prior over world-models as if the observation has been made: “someone in this world-model is sampling from the universal prior.” Maybe they also started with a speed prior of some form (which would cause them to be more likely to output the fast approximation of the human mind we were just discussing). And then after a while, they mess with the output.
Whatever better universal prior they come up with (e.g. anthropically updated speed prior), I think has a short description—shorter than [- log prob(intelligent life evolves and picks it) + description of simple universe].
It doesn’t make sense to me that they’re sampling from a universal prior and feeding it into the output channel, because the aliens are trying to take over other worlds through that output channel (and presumably they also have a distinguished input channel to go along with it), so they should be focusing on finding worlds that both can be taken over via the channel (including figuring out the computational costs of doing so) and are worth taking over (i.e., offers greater computational resources than their own), and then generating outputs that are optimized for taking over those worlds. Maybe this can be viewed as sampling from some kind of universal prior (with a short description), but I’m not seeing it. If you think it can or should be viewed that way, can you explain more?
In particular, if they’re trying to take over a computationally richer world, like ours, they have to figure out how to make sufficient predictions about the richer world using their own impoverished resources, which could involve doing research that’s equivalent to our physics, chemistry, biology, neuroscience, etc. I’m not seeing how sampling from “anthropically updated speed prior” would do the equivalent of all that (unless you end up sampling from a computation within the prior that consists of some aliens trying to take over our world).
I hadn’t thought about the can-do and the worth-doing update, in addition to the anthropic update. And it’s not that important, but for terminology’s sake, I forgot that the update could send a world-model’s prior to 0, so the prior might not be universal anymore.
The reason I think of these steps as updates to what started as a universal prior, is that they would like to take over as many possible worlds as possible, and they don’t know which one. And the universal prior is a good way to predict the dynamics of a world you know nothing about.
they have to figure out how to make sufficient predictions about the richer world using their own impoverished resources, which could involve doing research that’s equivalent to our physics, chemistry, biology, neuroscience, etc. I’m not seeing how sampling from “anthropically updated speed prior” would do the equivalent of all that
If you want to make fast predictions about an unknown world, I think that’s what we call a speed prior. Once the alien race has submitted a sequence of observations, they should act as if the observations were largely correct, because that’s the situation in which anything they do matters, so they are basically “learning” about the world they are copying (along with what they get from their input channel, of course, which corresponds to the operator’s actions). Sampling from a speed prior allows the aliens to output quick-to-compute plausible continuations of what they’ve outputted already. Hence, my reduction from [research about various topics] to [sampling from a speed prior].
But—when you add in the can-do update and the worth-doing update, I agree with you that the resulting measure (speed prior + anthropic update + can-do update + worth-doing update) might have a longer description than the measure which starts like that, then takes a treacherous turn. This case seems different to me (so I don’t make the same objection on this level) because the can-do update and the worth-doing update are about this treacherous turn.
So let me back up here. I don’t say anything in the Natural Prior Assumption about “for sufficiently small β,” but this makes me think I might need to. As I suggested above, I do think there is huge computational overhead that comes from having evolved life in a world running an algorithm on a “virtual machine” in their Turing-machine-simulated world, compared to the algorithm just being run on a Turing machine that is specialized for that algorithm. (75% confidence that life in a universe leads to egregious slowdown; 97% confidence that running on a virtual machine leads to at least 2x slowdown). And without the aliens involved, the “predict well” part is simpler than “predict well” + “treacherous turn.” In this version of the Natural Prior Assumption, the intuition is that control flow takes time to evaluate, even if in rare circumstances it doesn’t require more code. (Really, the reasoning that got us here is that in the rare case that treacherous world-models may be shorter to describe, they are also very slow, but that “intuition” seems less likely to generalize to world-models we haven’t thought about). But I do think a maximum a posteriori estimate using a speed prior would be pretty devastating to optimization daemons.
Another point that I was considering, and I’m not exactly sure where this fits in, but I have the intuition that for the speed prior that they use when they design plausible output, if they start with one that’s either more severe or less severe than the one we use, they lose ground. After i episodes, there’s a trade-off to be made between accuracy and speed for being a high-posterior world-model (where accuracy is more favored as i increases). If their speed prior isn’t severe enough, then at any given point in time, the world-model they use to pipe to output will be slower, which takes them more computation, which penalizes them. If their speed prior is too severe, they’ll be too focused on approximating and lose to more accurate world-models whose relative slowness we’re prepared to accommodate. I think their best bet is to match our speed prior, and take whatever advantage they can get from the anthropic update and picking their battles (the other two updates). Add “matching our prior” to the list of “things that make it hard to take over a universal prior.”
I’m glad that I’m getting some of my points across, but I think we still have some remaining disagreements or confusions here.
If you want to make fast predictions about an unknown world, I think that’s what we call a speed prior.
That doesn’t seem right to me. A speed prior still favors short algorithms. If you’re trying to make predictions about a computationally richer universe, why favor short algorithms? Why not apply your intelligence to try to discover the best algorithm (or increasingly better algorithms), regardless of the length?
Also, sampling from a speed prior involves randomizing over a mixture of TMs, but from an EU maximization perspective, wouldn’t running one particular TM from the mixture give the highest expected utility? Why are the aliens sampling from the speed prior instead of directly picking a specific algorithm to generate the next output, one that they expect to give the highest utility for them?
I don’t say anything in the Natural Prior Assumption about “for sufficiently small β,” but this makes me think I might need to.
What happens if β is too small? If it’s really tiny, then the world model with the highest posterior is random, right, because it’s “computed” by a TM that (to minimize run time) just copies everything on its random tape to the output? And as you increase β, the TM with highest posterior starts doing fast and then increasingly compute-intensive predictions?
As I suggested above, I do think there is huge computational overhead that comes from having evolved life in a world running an algorithm on a “virtual machine” in their Turing-machine-simulated world, compared to the algorithm just being run on a Turing machine that is specialized for that algorithm.
I think if β is small but not too small, the highest posterior would not involve evolved life, but instead a directly coded AGI that runs “natively” on the TM who can decide to execute arbitrary algorithms “natively” on the TM.
Maybe there is still some range of β where BoMAI is both safe and useful (can answer sophisticated questions like “how to build a safe unbounded AGI”) because in that range the highest posterior is a good non-life/non-AGI prediction algorithm. But A) I don’t know an argument for that, and B) even if it’s true, to take advantage of it would seem to require fine tuning β and I don’t see how to do that, given that trial-and-error wouldn’t be safe.
a directly coded AGI that runs “natively” on the TM who can decide to execute arbitrary algorithms “natively” on the TM.
At the end of the day, it will be running some subroutine for its gain trust/predict accurately phase.
I assume this sort of thing is true for any model of computation, but when you construct a universal Turing machine, so that it can simulate computation step after computation step of another Turing machine, it takes way more than one computation step for each one. If the AGI is using machinery that would allow it to simulate any world-model, it will be way slower than the Turing machine built for that algorithm.
I realize this seems really in-the-weeds and particular, but I think this is a general principle of computation. The more general a system is, the less well it can do any particular task. I think an AGI that chose to pipe viable predictions to the output with some procedure will be slower than the Turing machine which just runs that procedure.
If the AGI is using machinery that would allow it to simulate any world-model, it will be way slower than the Turing machine built for that algorithm.
I don’t buy it. All your programs are already running on UTM M.
Just consider a program that gives the aliens the ability to write arbitrary functions in M and then pass control to them. That program is barely any bigger (all you have to do is insert one use after free in physics :) ), and guarantees the aliens have zero slowdown.
For the literal simplest version of this, your program is M(Alien(), randomness), which is going to run just as fast as M(physics, randomness) for the intended physics, and probably much faster (if the aliens can think of any clever tricks to run faster without compromising much accuracy). The only reason you wouldn’t get this is if Alien is expensive. That probably rules out crazy alien civilizations, but I’m with Wei Dai that it probably doesn’t rule out simpler scientists.
Just consider a program that gives the aliens the ability to write arbitrary functions in M and then pass control to them.
That’s what I was thinking too, but Michael made me realize this isn’t possible, at least for some M. Suppose M is the C programming language, but in C there is no way to say “interpret this string as a C program and run it as fast as a native C program”. Am I missing something at this point?
all you have to do is insert one use after free in physics
That’s what I was thinking too, but Michael made me realize this isn’t possible, at least for some M. Suppose M is the C programming language, but in C there is no way to say “interpret this string as a C program and run it as fast as a native C program”. Am I missing something at this point?
I agree this is only going to be possible for some universal Turing machines. Though if you are using a Turing machine to define a speed prior, this does seem like a desirable property.
I don’t understand this sentence.
If physics is implemented in C, there are many possible bugs that would allow the attacker to execute arbitrary C code with no slowdown.
I agree this is only going to be possible for some universal Turing machines. Though if you are using a Turing machine to define a speed prior, this does seem like a desirable property.
Why is it a desirable property? I’m not seeing why it would be bad to choose a UTM that doesn’t have this property to define the speed prior for BoMAI, if that helps with safety. Please explain more?
I just mean: “universality” in the sense of a UTM isn’t a sufficient property when defining the speed prior, the analogous property of the UTM is something more like: “You can run an arbitrary Turing machine without too much slowdown.” Of course that’s not possible, but it seems like you still want to be as close to that as possible (for the same reasons that you wanted universality at all).
I agree that it would be fine to sacrifice this property if it was helpful for safety.
Each world-model is a Turing machine, whose prior relates to the Kolmogorov complexity (on some universal Turing machine) of the description of Turing machine—all the transition rules, and whatnot. Usually, this would be isomorphic (within a constant), but since we’re considering speed, programs actually aren’t simulated on a UTM.
Ok, I see, so in other words the AGI doesn’t have the ability to write an arbitrary function in the base programming language and call it, it has a fixed code base and has to simulate that function using its existing code. However I think the AGI can still win a race against a straightforward “predict accurately” algorithm, because it can to two things. 1) Include the most important inner loops of the “predict accurately” algorithm as functions in its own code to minimize the relative slowdown (this is not a decision by the AGI but just a matter of which AGI ends up having the highest posterior) and 2) keep finding improvements to its own prediction algorithm so that it can eventually overtake any fixed prediction algorithm in accuracy which hopefully more than “pays for” the remaining slowdown that is incurred.
Let the AGI’s “predict accurately” algorithm be fixed.
What you call a sequence of improvements to the prediction algorithm, let’s just call that the prediction algorithm. Imagine this to have as much or as little overhead as you like compared to what was previously conceptualized as “predict accurately.” I think this reconceptualization eliminates 2) as a concern, and if I’m understanding correctly, 1) is only able to mitigate slowdown, not overpower it.
Also I think 1) doesn’t work—maybe you came to this conclusion as well?
Suppose M is the C programming language, but in C there is no way to say “interpret this string as a C program and run it as fast as a native C program”.
But maybe you’re saying that doesn’t apply because:
(this is not a decision by the AGI but just a matter of which AGI ends up having the highest posterior)
I think this way throws off the contention that this AGI will have a short description length. One can imagine a sliding scale here. Short description, lots of overhead: a simple universe evolves life, aliens decide to run “predict accurately” + “treacherous turn”. Longer description, less overhead: an AGI that runs “predict accurately” + “treacherous turn.” Longer description, less overhead: an AGI with some of the subroutines involved already (conveniently) baked in to its architecture. Once all the subroutines are “baked into its architecture” you just have: the algorithm “predict accurately” + “treacherous turn”. And in this form, that has a longer description than just “predict accurately”.
I’ve made a case that the two endpoints in the trade-off are not problematic. I’ve argued (roughly) that one reduces computational overhead by doing things that dissociate the naturalness of describing “predict accurately” and “treacherous turn” all at once. This goes back to the general principle I proposed above: “The more general a system is, the less well it can do any particular task.” The only thing I feel like I can still do is argue against particular points in the trade-off that you think are likely to cause trouble. Can you point me to an exact inner loop that can be native to an AGI that would cause this to fall outside of this trend? To frame this case, the Turing machine description must specify [AGI + a routine that it can call]--sort of like a brain-computer interface, where the AGI is the brain and the fast routine is the computer.
Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.
B) even if it’s true, to take advantage of it would seem to require fine tuning β and I don’t see how to do that, given that trial-and-error wouldn’t be safe.
Fine tuning from both sides isn’t safe. Approach from below.
Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.
I just read the math more carefully, and it looks like no matter how small β is, as long as β is positive, as BoMAI receives more and more input, it will eventually converge to the most accurate world model possible. This is because the computation penalty is applied to the per-episode computation bound and doesn’t increase with each episode, whereas the accuracy advantage gets accumulated across episodes.
Assuming that the most accurate world model is an exponential-time quantum simulation, that’s what BoMAI will converge to (no matter how small β is), right? And in the meantime it will go through some arbitrarily complex (up to some very large bound) but faster than exponential classical approximations of quantum physics that are increasingly accurate, as the number of episodes increase? If so, I’m no longer convinced that BoMAI is benign as long as β is small enough, because the qualitative behavior of BoMAI seems the same no matter what β is, i.e., it gets smarter over time as its world model gets more accurate, and I’m not sure why the reason BoMAI might not be benign at high β couldn’t also apply at low β (if we run it for a long enough time).
(If you’re going to discuss all this in your “longer reply”, I’m fine with waiting for it.)
The longer reply will include an image that might help, but a couple other notes. If it causes you to doubt the asymptotic result, it might be helpful to read the benignity proof (especially the proof of Rejecting the Simple Memory-Based Lemma, which isn’t that long). The heuristic reason for why it can be helpful to decrease β for long-run behavior, even though long-run behavior is qualitatively similar, is that while accuracy eventually becomes the dominant concern, along the way the prior is *sort of* a random perturbation to this which changes the posterior weight, so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length. Put another way, for benignity, we don’t need concern for speed to dominate concern for accuracy; we need it to dominate concern for “simplicity” (on some reference machine).
so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length
Yeah, I understand this part, but I’m not sure why, since the benign one can be extremely complex, the malign one can’t have enough of a K-complexity advantage to overcome its slowness penalty. And since (with low β) we’re going through many more different world models as the number of episodes increases, that also gives malign world models more chances to “win”? It seems hard to make any trustworthy conclusions based on the kind of informal reasoning we’ve been doing and we need to figure out the actual math somehow.
And since (with low β) we’re going through many more different world models as the number of episodes increases, that also gives malign world models more chances to “win”?
Check out the order of the quantifiers in the proofs. One β works for all possibilities. If the quantifiers were in the other order, they couldn’t be trivially flipped since the number of world-models is infinite, and the intuitive worry about malign world-models getting “more chances to win” would apply.
Let’s continue the conversation here, and this may be a good place to reference this comment.
Fine tuning from both sides isn’t safe. Approach from below.
Sure, approaching from below is obvious, but that still requires knowing how wide the band of β that would produce a safe and useful BoMAI is, otherwise even if the band exists you could overshoot it and end up in the unsafe region.
ETA: But the first question is, is there a β such that BoMAI is both safe and intelligent enough to answer questions like “how to build a safe unbounded AGI”? When β is very low BoMAI is useless, and as you increase β it gets smarter, but then at some point with a high enough β it becomes unsafe. Do you know a way to figure out how smart BoMAI is just before it becomes unsafe?
But then one needs to factor in “simplicity” or the prior penalty from description length:
Note also that these are average effects; they are just for forming intuitions.
Your concern was:
is there a β such that BoMAI is both safe and intelligent enough to answer questions like “how to build a safe unbounded AGI” [after a reasonable number of episodes]?
This was the sort of thing I assumed could be improved upon later once the asymptotic result was established. Now that you’re asking for the improvement, here’s a proposal:
Set β safely. Once enough observations have been provided that you believe human-level AI should be possible, exclude world-models that use less than s←1 computation steps per episode. Every episode, increase s until human-level performance is reached. Under the assumption that the average computation time of a malign world-model is at least a constant times that of the “corresponding” benign one (corresponding in the sense of using the same ((coarse) approximate) simulation of the world), then s←αs should be safe for some α>1 (and α−1≉0).
I need to think more carefully about what happens here, but I think the design space is large.
Also, sampling from a speed prior involves randomizing over a mixture of TMs, but from an EU maximization perspective, wouldn’t running one particular TM from the mixture give the highest expected utility?
All the better. They don’t what know universe is using the prior. What are the odds our universe is the single most susceptible universe to being taken over?
I was assuming the worst, and guessing that there are diminishing marginal returns once your odds of a successful takeover get above ~50%, so instead of going all in on accurate predictions of the weakest and ripest target universe, you hedge and target a few universes. And I was assuming the worst in assuming they’d be so good at this, they’d be able to do this for a large number of universes at once.
To clarify: diminishing marginal returns of takeover probability of a universe with respect to the weight you give that universe in your prior that you pipe to output.
I was assuming the worst, and guessing that there are diminishing marginal returns once your odds of a successful takeover get above ~50%, so instead of going all in on accurate predictions of the weakest and ripest target universe, you hedge and target a few universes.
There are massive diminishing marginal returns; in a naive model you’d expect essentially *every* universe to get predicted in this way.
But Wei Dai’s basic point still stands. The speed prior isn’t the actual prior over universes (i.e. doesn’t reflect the real degree of moral concern that we’d use to weigh consequences of our decisions in different possible worlds). If you have some data that you are trying to predict, you can do way better than the speed prior by (a) using your real prior to estimate or sample from the actual posterior distribution over physical law, (b) using engineering reasoning to make the utility maximizing predictions, given that faster predictions are going to get given more weight.
(You don’t really need this to run Wei Dai’s argument, because there seem to be dozens of ways in which the aliens get an advantage over the intended physical model.)
When universal prior is next to speed update, this is naturally conceptualized as a speed prior, and when it’s last, it is naturally conceptualized as “engineering reasoning” identifying faster predictions.
I happy to go with the second order if you prefer, in part because I think they do commute—all these updates just change the weights on measures that get mixed together to be piped to output during the “predict accurately” phase.
If you’re trying to make predictions about a computationally richer universe, why favor short algorithms? Why not apply your intelligence to try to discover the best algorithm (or increasingly better algorithms), regardless of the length?
You have a countable list of options. What choice do you have but to favor the ones at the beginning? Any (computable) permutation of the things on the list just corresponds to a different choice of universal Turing machine for which a “short” algorithm just means it’s earlier on the list.
And a “sequence of increasingly better algorithms,” if chosen in a computable way, is just a computable algorithm.
The fast algorithms to predict our physics just aren’t going to be the shortest ones. You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.
You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.
Using “reasoning” to pick which one to favor, is just picking the first one in some new order. (And not really picking the first one, just giving earlier ones preferential treatment). In general, if you have an infinite list of possibilities, and you want to pick the one that maximizes some property, this is not a procedure that halts. I’m agnostic about what order you use (for now) but one can’t escape the necessity to introduce the arbitrary criterion of “valuing” earlier things on the list. One can put 50% probability mass on the first billion instead of the first 1000 if one wants to favor “simplicity” less, but you can’t make that number infinity.
Using “reasoning” to pick which one to favor, is just picking the first one in some new order.
Yes, some new order, but not an arbitrary one. The resulting order is going to be better than the speed prior order, so we’ll update in favor of the aliens and away from the rest of the speed prior.
one can’t escape the necessity to introduce the arbitrary criterion of “valuing” earlier things on the list
Probably some miscommunication here. No one is trying to object to the arbitrariness, we’re just making the point that the aliens have a lot of leverage with which to beat the rest of the speed prior.
(They may still not be able to if the penalty for computation is sufficiently steep—e.g. if you penalize based on circuit complexity so that the model might as well bake in everything that doesn’t depend on the particular input at hand. I think it’s an interesting open question whether that avoids all problems of this form, which I unsuccessfully tried to get at here.)
They may still not be able to if the penalty for computation is sufficiently steep
It was definitely reassuring to me that someone else had had the thought that prioritizing speed could eliminate optimization daemons (re: minimal circuits), since the speed prior came in here for independent reasons. The only other approach I can think of is trying to do the anthropic update ourselves.
The only point I was trying to respond to in the grandparent of this comment was your comment
The fast algorithms to predict our physics just aren’t going to be the shortest ones. You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.
Your concern (I think) is that our speed prior would assign a lower probability to [fast approximation of real world] than the aliens’ speed prior.
I can’t respond at once to all of the reasons you have for this belief, but the one I was responding to here (which hopefully we can file away before proceeding) was that our speed prior trades off shortness with speed, and aliens could avoid this and only look at speed.
My point here was just that there’s no way to not trade off shortness with speed, so no one has a comparative advantage on us as result of the claim “The fast algorithms to predict our physics just aren’t going to be the shortest ones.”
The “after figuring out physics” part is like saying that they can use a prior which is updated based on evidence. They will observe evidence for what our physics is like, and use that to update their posterior,but that’s exactly what we’re doing to. The prior they start with can’t be designed around our physics. I think that the only place this reasoning gets you is that their posterior will assign a higher probability to [fast approximation of real world] than our prior does, because the world-models have been reasonably reweighted in light of their “figuring out physics”. Of course I don’t object to that—our speed prior’s posterior will be much better than the prior too.
It seems totally different from what we’re doing, I may be misunderstanding the analogy.
Suppose I look out at the world and do some science, e.g. discovering the standard model. Then I use my understanding of science to design great prediction algorithms that run fast, but are quite complicated owing to all of the approximations and heuristics baked into them.
The speed prior gives this model a very low probability because it’s a complicated model. But “do science” gives this model a high probability, because it’s a simple model of physics, and then the approximations follow from a bunch of reasoning on top of that model of physics. We aren’t trading off “shortness” for speed—we are trading off “looks good according to reasoning” for speed. Yes they are both arbitrary orders, but one of them systematically contains better models earlier in the order, since the output of reasoning is better than a blind prioritization of shorter models.
Of course the speed prior also includes a hypothesis that does “science with the goal of making good predictions,” and indeed Wei Dai and I are saying that this is the part of the speed prior that will dominate the posterior. But now we are back to potentially-malign consequentialistism. The cognitive work being done internally to that hypothesis is totally different from the work being done by updating on the speed prior (except insofar as the speed prior literally contains a hypothesis that does that work).
In other words:
Suppose physics takes n bits to specify, and a reasonable approximation takes N >> n bits to specify. Then the speed prior, working in the intended way, takes N bits to arrive at the reasonable approximation. But the aliens take n bits to arrive at the standard model, and then once they’ve done that can immediately deduce the N bit approximation. So it sure seems like they’ll beat the speed prior. Are you objecting to this argument?
(In fact the speed prior only actually takes n + O(1) bits, because it can specify the “do science” strategy, but that doesn’t help here since we are just trying to say that the “do science” strategy dominates the speed prior.)
I’m not sure which of these arguments will be more convincing to you.
Yes they are both arbitrary orders, but one of them systematically contains better models earlier in the order, since the output of reasoning is better than a blind prioritization of shorter models.
This is what is what I was trying to contextualize above. This is an unfair comparison. You’re imagining that the “reasoning”-based order gets to see past observations, and the “shortness”-based order does not. A reasoning-based order is just a shortness-based order that has been updated into a posterior after seeing observations (under the view that good reasoning is Bayesian reasoning). Maybe the term “order” is confusing us, because we both know it’s a distribution, not an order, and we were just simplifying to a ranking. A shortness-based order should really just be called a prior, and a reasoning-based order (at least a Bayesian-reasoning-based order) should really just be called a posterior (once it has done some reasoning; before it has done the reasoning, it is just a prior too). So yes, the whole premise of Bayesian reasoning is that updating based on reasoning is a good thing to do.
Here’s another way to look at it.
The speed prior is doing the brute force search that scientists try to approximate efficiently. The search is for a fast approximation of the environment. The speed prior considers them all. The scientists use heuristics to find one.
In fact the speed prior only actually takes n + O(1) bits, because it can specify the “do science” strategy
Exactly. But this does help for reasons I describe here. The description length of the “do science” strategy (I contend) is less than the description length of the “do science” + “treacherous turn” strategy. (I initially typed that as “tern”, which will now be the image I have of a treacherous turn.)
a reasoning-based order (at least a Bayesian-reasoning-based order) should really just be called a posterior
Reasoning gives you a prior that is better than the speed prior, before you see any data. (*Much* better, limited only by the fact that the speed prior contains strategies which use reasoning.)
The reasoning in this case is not a Bayesian update. It’s evaluating possible approximations *by reasoning about how well they approximate the underlying physics, itself inferred by a Bayesian update*, not by directly seeing how well they predict on the data so far.
The description length of the “do science” strategy (I contend) is less than the description length of the “do science” + “treacherous turn” strategy.
I can reply in that thread.
I think the only good arguments for this are in the limit where you don’t care about simplicity at all and only care about running time, since then you can rule out all reasoning. The threshold where things start working depends on the underlying physics, for more computationally complex physics you need to pick larger and larger computation penalties to get the desired result.
Given a world model ν, which takes k computation steps per episode, let νlog be the best world-model that best approximates ν (in the sense of KL divergence) using only logk computation steps.νlog is at least as good as the “reasoning-based replacement” of ν.
The description length of νlog is within a (small) constant of the description length of ν. That way of describing it is not optimized for speed, but it presents a one-time cost, and anyone arriving at that world-model in this way is paying that cost.
One could consider instead νlogε, which is, among the world-models that ε-approximate ν in less than logk computation steps (if the set is non-empty), the first such world-model found by a searching procedure ψ. The description length of νlogε is within a (slightly larger) constant of the description length of ν, but the one-time computational cost is less than that of νlog.
νlog, νlogε, and a host of other approaches are prominently represented in the speed prior.
If this is what you call “the speed prior doing reasoning,” so be it, but the relevance for that terminology only comes in when you claim that “once you’ve encoded ‘doing reasoning’, you’ve basically already written the code for it to do the treachery that naturally comes along with that.” That sense of “reasoning” really only applies, I think, to the case where our code is simulating aliens or an AGI.
(ETA: I think this discussion depended on a detail of your version of the speed prior that I misunderstood.)
Given a world model ν, which takes k computation steps per episode, let νlog be the best world-model that best approximates ν (in the sense of KL divergence) using only logk computation steps. νlog is at least as good as the “reasoning-based replacement” of ν.
The description length of νlog is within a (small) constant of the description length of ν. That way of describing it is not optimized for speed, but it presents a one-time cost, and anyone arriving at that world-model in this way is paying that cost.
To be clear, that description gets ~0 mass under the speed prior, right? A direct specification of the fast model is going to have a much higher prior than a brute force search, at least for values of β large enough (or small enough, however you set it up) to rule out the alien civilization that is (probably) the shortest description without regard for computational limits.
One could consider instead νlogε, which is, among the world-models that ε-approximate ν in less than logk computation steps (if the set is non-empty), the first such world-model found by a searching procedure ψ. The description length of νlogε is within a (slightly larger) constant of the description length of ν, but the one-time computational cost is less than that of νlog.
Within this chunk of the speed prior, the question is: what are good ψ? Any reasonable specification of a consequentialist would work (plus a few more bits for it to understand its situation, though most of the work is done by handing it ν), or of a petri dish in which a consequentialist would eventually end up with influence. Do you have a concrete alternative in mind, which you think is not dominated by some consequentialist (i.e. a ψ for which every consequentialist is either slower or more complex)?
Do you have a concrete alternative in mind, which you think is not dominated by some consequentialist (i.e. a ψ for which every consequentialist is either slower or more complex)?
Well one approach is in the flavor of the induction algorithm I messaged you privately about (I know I didn’t give you a completely specified algorithm). But when I wrote that, I didn’t have a concrete algorithm in mind. Mostly, it just seems to me that the powerful algorithms which have been useful to humanity have short descriptions in themselves. It seems like there are many cases where there is a simple “ideal” approach which consequentialists “discover” or approximately discover. A powerful heuristic search would be one such algorithm, I think.
(ETA: I think this discussion depended on a detail of your version of the speed prior that I misunderstood.)
I don’t think anything here changes if K(x) were replaced with S(x) (if that was what you misunderstood).
And a “sequence of increasingly better algorithms,” if chosen in a computable way, is just a computable algorithm.
True but I’m arguing that this computable algorithm is just the alien itself, trying to answer the question “how can I better predict this richer world in order to take it over?” If there is no shorter/faster algorithm that can come up with a sequence of increasingly better algorithms, what is the point of saying that the alien is sampling from the speed prior, instead of saying that the alien is thinking about how to answer “how can I better predict this richer world in order to take it over?” Actually if this alien was sampling from the speed prior, then it would no longer be the shortest/fastest algorithm to come up with a sequence of increasingly better algorithms, and some other alien trying to take over our world would have the highest posterior instead.
I’m having a hard time following this. Can you expand on this, without using “sequence of increasingly better algorithms”? I keep translating that to “algorithm.”
My worry at this point is that if simulating the real world using actual physics takes exponential time on your UTM, the world model with the greatest posterior may not be such a simulation but instead for example an alien superintelligence that runs efficiently on a classical TM which is predicting the behavior of the operator (using various algorithms that it came up with that run efficiently on a classical computer) and at some point the alien superintelligence will cause BoMAI to output something to mind hack the operator and then take over our universe. I’m not sure which assumption this would violate, but do you see this as a reasonable concern?
The theorem is consistent with the aliens causing trouble any finite number of times. But each time they cause the agent to do something weird their model loses some probability, so there will be some episode after which they stop causing trouble (if we manage to successfully run enough episodes without in fact having anything bad happen in the meantime, which is an assumption of the asymptotic arguments).
Thanks. Is there a way to derive a concrete bound on how long it will take for BoMAI to become “benign”, e.g., is it exponential or something more reasonable? (Although if even a single “malign” episode could lead to disaster, this may be only of academic interest.) Also, to comment on this section of the paper:
“We can only offer informal claims regarding what happens before BoMAI is definitely benign. One intuition is that eventual benignity with probability 1 doesn’t happen by accident: it suggests that for the entire lifetime of the agent, everything is conspiring to make the agent benign.”
If BoMAI can be effectively controlled by alien superintelligences before it becomes “benign” that would suggest “everything is conspiring to make the agent benign” is misleading as far as reasoning about what BoMAI might do in the mean time.
Is this noted somewhere in the paper, or just implicit in the arguments? I guess what we actually need is either a guarantee that all episodes are “benign” or a bound on utility loss that we can incur through such a scheme. (I do appreciate that “in the absence of any other algorithms for general intelligence which have been proven asymptotically benign, let alone benign for their entire lifetimes, BoMAI represents meaningful theoretical progress toward designing the latter.”)
The closest thing to a discussion of this so far is Appendix E, but I have not yet thought through this very carefully. When you ask if it is exponential, what exactly are you asking if it is exponential in?
I guess I was asking if it’s exponential in anything that would make BoMAI impractically slow to become “benign”, so basically just using “exponential” as a shorthand for “impractically large”.
I don’t think it is, thank you for pointing this out.
Agreed that would be misleading, but I don’t think it would be controlled by alien superintelligences.
Consider algorithm the alien superintelligence is running to predict the behavior of the operator which runs efficiently on a classical TM (Algorithm A). Now compare Algorithm A with Algorithm B: simulate aliens deciding to run algorithm A; run algorithm A; except at some point, figure out when to do a treacherous turn, and then do it.
Algorithm B is clearly slower than Algorithm A, so Algorithm B loses.
There is an important conversation to be had here: your particular example isn’t concerning, but maybe we just haven’t thought of an analog that is concerning. Regardless, I think has become divorced from the discussion about quantum mechanics.
This is why I try to write down all the assumptions to rule out a whole host of world-models we haven’t even considered. In the argument in the paper, the assumption that rules out this example is the Natural Prior Assumption (assumption 3), although I think for your particular example, the argument I just gave is more straightforward.
Yes but algorithm B may be shorter than algorithm A, because it could take a lot of bits to directly specify an algorithm that would accurately predict a human using a classical computer, and less bits to pick out an alien superintelligence who has an instrumental reason to invent such an algorithm. If β is set to be so near 1 that the exponential time simulation of real physics can have the highest posterior within a reasonable time, the fact that B is slower than A makes almost no difference and everything comes down to program length.
Quantum mechanics is what’s making B being slower than A not matter (via the above argument).
Epistemic status: shady
So I’m a bit baffled by the philosophy here, but here’s why I haven’t been concerned with the long time it would take BoMAI to entertain the true environment (and it might well, given a safe value of β).
There is relatively clear distinction one can make between objective probabilities and subjective ones. The asymptotic benignity result makes use of world-models that perfectly match the objective probabilities rising to the top.
Consider a new kind of probability: a “k-optimal subjective probability.” That is, the best (in the sense of KL divergence) approximation of the objective probabilities that can be sampled from using a UTM and using only k computation steps. Suspend disbelief for a moment, and suppose we thought of these probabilities as objective probabilities. My intuition here is that everything works just great when agents treat subjective probabilities like real probabilities, and to a k-bounded agent, it feels like there is some sense in which these might as well be objective probabilities; the more intricate structure is inaccessible. If no world-models were considered that allowed more than k computation steps per timestep (mk per episode I guess, whatever), then just by calling “k-optimal subjective probabilities” “objective,” the same benignity theorems would apply, where the role in the proofs of [the world-model that matches the objective probabilities] is replaced by [the world-model that matches the k-optimal subjective probabilities]. And in this version, i0 comes much sooner, and the limiting value of intelligence is reached much sooner.
Of course, “the limiting value of intelligence” is much less, because only fast world-models are considered. But that just goes to show that even if, on a human timescale, BoMAI basically never fields a world-model that actually matches objective probabilities, along the way, it will still be fielding the best ones available that use a more modest computation budget. Once the computation budget surpasses the human brain, that should suffice for it to be practically intelligent.
EDIT: if one sets β to be safe, then if this logic fails, BoMAI will be useless, not dangerous.
If there’s an efficient classical approximation of quantum dynamics, I bet this has a concise and lovely mathematical description. I bet that description is much shorter than “in Conway’s game of life, the efficient approximation of quantum mechanics that whatever lifeform emerges will probably come up with.”
But I’m hesitant here. This is exactly the sort of conversation I wanted to have.
I doubt that there’s an efficient classical approximation of quantum dynamics in general. There are probably tricks to speed up the classical approximation of a human mind though (or parts of a human mind), that an alien superintelligence could discover. Consider this analogy. Suppose there’s a robot stranded on a planet without technology. What’s the shortest algorithm for controlling the robot such that it eventually leaves that planet and reaches another star? It’s probably some kind of AGI that has an instrumental goal of reaching another star, right? (It could also be a terminal goal, but there are many other terminal goals that call for interstellar travel as an instrumental goal so the latter seems more likely.) Leaving the planet calls for solving many problems that come up, on the fly, including inventing new algorithms for solving them. If you put all these individual solutions and algorithms together that would also be an algorithm for reaching another star but it could be a lot longer than the code for the AGI.
I see—so I think I make the same response on a different level then.
My model for this is: the world-model is a stochastic simple world, something like Conway’s game of life (but with randomness). Life evolves. The output channel has distinguished within-world effects, so that inhabitants can recognize it. The inhabitants control the output channel and use some of their world’s noise to sample from a universal prior, which they then feed into the output channel. But they don’t just use any universal prior—they use a better one, one which updates the prior over world-models as if the observation has been made: “someone in this world-model is sampling from the universal prior.” Maybe they also started with a speed prior of some form (which would cause them to be more likely to output the fast approximation of the human mind we were just discussing). And then after a while, they mess with the output.
Whatever better universal prior they come up with (e.g. anthropically updated speed prior), I think has a short description—shorter than [- log prob(intelligent life evolves and picks it) + description of simple universe].
It doesn’t make sense to me that they’re sampling from a universal prior and feeding it into the output channel, because the aliens are trying to take over other worlds through that output channel (and presumably they also have a distinguished input channel to go along with it), so they should be focusing on finding worlds that both can be taken over via the channel (including figuring out the computational costs of doing so) and are worth taking over (i.e., offers greater computational resources than their own), and then generating outputs that are optimized for taking over those worlds. Maybe this can be viewed as sampling from some kind of universal prior (with a short description), but I’m not seeing it. If you think it can or should be viewed that way, can you explain more?
In particular, if they’re trying to take over a computationally richer world, like ours, they have to figure out how to make sufficient predictions about the richer world using their own impoverished resources, which could involve doing research that’s equivalent to our physics, chemistry, biology, neuroscience, etc. I’m not seeing how sampling from “anthropically updated speed prior” would do the equivalent of all that (unless you end up sampling from a computation within the prior that consists of some aliens trying to take over our world).
I think you might be more or less right here.
I hadn’t thought about the can-do and the worth-doing update, in addition to the anthropic update. And it’s not that important, but for terminology’s sake, I forgot that the update could send a world-model’s prior to 0, so the prior might not be universal anymore.
The reason I think of these steps as updates to what started as a universal prior, is that they would like to take over as many possible worlds as possible, and they don’t know which one. And the universal prior is a good way to predict the dynamics of a world you know nothing about.
If you want to make fast predictions about an unknown world, I think that’s what we call a speed prior. Once the alien race has submitted a sequence of observations, they should act as if the observations were largely correct, because that’s the situation in which anything they do matters, so they are basically “learning” about the world they are copying (along with what they get from their input channel, of course, which corresponds to the operator’s actions). Sampling from a speed prior allows the aliens to output quick-to-compute plausible continuations of what they’ve outputted already. Hence, my reduction from [research about various topics] to [sampling from a speed prior].
But—when you add in the can-do update and the worth-doing update, I agree with you that the resulting measure (speed prior + anthropic update + can-do update + worth-doing update) might have a longer description than the measure which starts like that, then takes a treacherous turn. This case seems different to me (so I don’t make the same objection on this level) because the can-do update and the worth-doing update are about this treacherous turn.
So let me back up here. I don’t say anything in the Natural Prior Assumption about “for sufficiently small β,” but this makes me think I might need to. As I suggested above, I do think there is huge computational overhead that comes from having evolved life in a world running an algorithm on a “virtual machine” in their Turing-machine-simulated world, compared to the algorithm just being run on a Turing machine that is specialized for that algorithm. (75% confidence that life in a universe leads to egregious slowdown; 97% confidence that running on a virtual machine leads to at least 2x slowdown). And without the aliens involved, the “predict well” part is simpler than “predict well” + “treacherous turn.” In this version of the Natural Prior Assumption, the intuition is that control flow takes time to evaluate, even if in rare circumstances it doesn’t require more code. (Really, the reasoning that got us here is that in the rare case that treacherous world-models may be shorter to describe, they are also very slow, but that “intuition” seems less likely to generalize to world-models we haven’t thought about). But I do think a maximum a posteriori estimate using a speed prior would be pretty devastating to optimization daemons.
Another point that I was considering, and I’m not exactly sure where this fits in, but I have the intuition that for the speed prior that they use when they design plausible output, if they start with one that’s either more severe or less severe than the one we use, they lose ground. After i episodes, there’s a trade-off to be made between accuracy and speed for being a high-posterior world-model (where accuracy is more favored as i increases). If their speed prior isn’t severe enough, then at any given point in time, the world-model they use to pipe to output will be slower, which takes them more computation, which penalizes them. If their speed prior is too severe, they’ll be too focused on approximating and lose to more accurate world-models whose relative slowness we’re prepared to accommodate. I think their best bet is to match our speed prior, and take whatever advantage they can get from the anthropic update and picking their battles (the other two updates). Add “matching our prior” to the list of “things that make it hard to take over a universal prior.”
I’m glad that I’m getting some of my points across, but I think we still have some remaining disagreements or confusions here.
That doesn’t seem right to me. A speed prior still favors short algorithms. If you’re trying to make predictions about a computationally richer universe, why favor short algorithms? Why not apply your intelligence to try to discover the best algorithm (or increasingly better algorithms), regardless of the length?
Also, sampling from a speed prior involves randomizing over a mixture of TMs, but from an EU maximization perspective, wouldn’t running one particular TM from the mixture give the highest expected utility? Why are the aliens sampling from the speed prior instead of directly picking a specific algorithm to generate the next output, one that they expect to give the highest utility for them?
What happens if β is too small? If it’s really tiny, then the world model with the highest posterior is random, right, because it’s “computed” by a TM that (to minimize run time) just copies everything on its random tape to the output? And as you increase β, the TM with highest posterior starts doing fast and then increasingly compute-intensive predictions?
I think if β is small but not too small, the highest posterior would not involve evolved life, but instead a directly coded AGI that runs “natively” on the TM who can decide to execute arbitrary algorithms “natively” on the TM.
Maybe there is still some range of β where BoMAI is both safe and useful (can answer sophisticated questions like “how to build a safe unbounded AGI”) because in that range the highest posterior is a good non-life/non-AGI prediction algorithm. But A) I don’t know an argument for that, and B) even if it’s true, to take advantage of it would seem to require fine tuning β and I don’t see how to do that, given that trial-and-error wouldn’t be safe.
At the end of the day, it will be running some subroutine for its gain trust/predict accurately phase.
I assume this sort of thing is true for any model of computation, but when you construct a universal Turing machine, so that it can simulate computation step after computation step of another Turing machine, it takes way more than one computation step for each one. If the AGI is using machinery that would allow it to simulate any world-model, it will be way slower than the Turing machine built for that algorithm.
I realize this seems really in-the-weeds and particular, but I think this is a general principle of computation. The more general a system is, the less well it can do any particular task. I think an AGI that chose to pipe viable predictions to the output with some procedure will be slower than the Turing machine which just runs that procedure.
I don’t buy it. All your programs are already running on UTM M.
Just consider a program that gives the aliens the ability to write arbitrary functions in M and then pass control to them. That program is barely any bigger (all you have to do is insert one use after free in physics :) ), and guarantees the aliens have zero slowdown.
For the literal simplest version of this, your program is M(Alien(), randomness), which is going to run just as fast as M(physics, randomness) for the intended physics, and probably much faster (if the aliens can think of any clever tricks to run faster without compromising much accuracy). The only reason you wouldn’t get this is if Alien is expensive. That probably rules out crazy alien civilizations, but I’m with Wei Dai that it probably doesn’t rule out simpler scientists.
That’s what I was thinking too, but Michael made me realize this isn’t possible, at least for some M. Suppose M is the C programming language, but in C there is no way to say “interpret this string as a C program and run it as fast as a native C program”. Am I missing something at this point?
I don’t understand this sentence.
I agree this is only going to be possible for some universal Turing machines. Though if you are using a Turing machine to define a speed prior, this does seem like a desirable property.
If physics is implemented in C, there are many possible bugs that would allow the attacker to execute arbitrary C code with no slowdown.
Why is it a desirable property? I’m not seeing why it would be bad to choose a UTM that doesn’t have this property to define the speed prior for BoMAI, if that helps with safety. Please explain more?
I just mean: “universality” in the sense of a UTM isn’t a sufficient property when defining the speed prior, the analogous property of the UTM is something more like: “You can run an arbitrary Turing machine without too much slowdown.” Of course that’s not possible, but it seems like you still want to be as close to that as possible (for the same reasons that you wanted universality at all).
I agree that it would be fine to sacrifice this property if it was helpful for safety.
Each world-model is a Turing machine, whose prior relates to the Kolmogorov complexity (on some universal Turing machine) of the description of Turing machine—all the transition rules, and whatnot. Usually, this would be isomorphic (within a constant), but since we’re considering speed, programs actually aren’t simulated on a UTM.
Ok, I see, so in other words the AGI doesn’t have the ability to write an arbitrary function in the base programming language and call it, it has a fixed code base and has to simulate that function using its existing code. However I think the AGI can still win a race against a straightforward “predict accurately” algorithm, because it can to two things. 1) Include the most important inner loops of the “predict accurately” algorithm as functions in its own code to minimize the relative slowdown (this is not a decision by the AGI but just a matter of which AGI ends up having the highest posterior) and 2) keep finding improvements to its own prediction algorithm so that it can eventually overtake any fixed prediction algorithm in accuracy which hopefully more than “pays for” the remaining slowdown that is incurred.
Let the AGI’s “predict accurately” algorithm be fixed.
What you call a sequence of improvements to the prediction algorithm, let’s just call that the prediction algorithm. Imagine this to have as much or as little overhead as you like compared to what was previously conceptualized as “predict accurately.” I think this reconceptualization eliminates 2) as a concern, and if I’m understanding correctly, 1) is only able to mitigate slowdown, not overpower it.
Also I think 1) doesn’t work—maybe you came to this conclusion as well?
But maybe you’re saying that doesn’t apply because:
I think this way throws off the contention that this AGI will have a short description length. One can imagine a sliding scale here. Short description, lots of overhead: a simple universe evolves life, aliens decide to run “predict accurately” + “treacherous turn”. Longer description, less overhead: an AGI that runs “predict accurately” + “treacherous turn.” Longer description, less overhead: an AGI with some of the subroutines involved already (conveniently) baked in to its architecture. Once all the subroutines are “baked into its architecture” you just have: the algorithm “predict accurately” + “treacherous turn”. And in this form, that has a longer description than just “predict accurately”.
You only have to bake in the innermost part of one loop in order to get almost all the computational savings.
I’ve made a case that the two endpoints in the trade-off are not problematic. I’ve argued (roughly) that one reduces computational overhead by doing things that dissociate the naturalness of describing “predict accurately” and “treacherous turn” all at once. This goes back to the general principle I proposed above: “The more general a system is, the less well it can do any particular task.” The only thing I feel like I can still do is argue against particular points in the trade-off that you think are likely to cause trouble. Can you point me to an exact inner loop that can be native to an AGI that would cause this to fall outside of this trend? To frame this case, the Turing machine description must specify [AGI + a routine that it can call]--sort of like a brain-computer interface, where the AGI is the brain and the fast routine is the computer.
(I actually have a more basic confusion, started a new thread.)
Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.
Fine tuning from both sides isn’t safe. Approach from below.
I just read the math more carefully, and it looks like no matter how small β is, as long as β is positive, as BoMAI receives more and more input, it will eventually converge to the most accurate world model possible. This is because the computation penalty is applied to the per-episode computation bound and doesn’t increase with each episode, whereas the accuracy advantage gets accumulated across episodes.
Assuming that the most accurate world model is an exponential-time quantum simulation, that’s what BoMAI will converge to (no matter how small β is), right? And in the meantime it will go through some arbitrarily complex (up to some very large bound) but faster than exponential classical approximations of quantum physics that are increasingly accurate, as the number of episodes increase? If so, I’m no longer convinced that BoMAI is benign as long as β is small enough, because the qualitative behavior of BoMAI seems the same no matter what β is, i.e., it gets smarter over time as its world model gets more accurate, and I’m not sure why the reason BoMAI might not be benign at high β couldn’t also apply at low β (if we run it for a long enough time).
(If you’re going to discuss all this in your “longer reply”, I’m fine with waiting for it.)
The longer reply will include an image that might help, but a couple other notes. If it causes you to doubt the asymptotic result, it might be helpful to read the benignity proof (especially the proof of Rejecting the Simple Memory-Based Lemma, which isn’t that long). The heuristic reason for why it can be helpful to decrease β for long-run behavior, even though long-run behavior is qualitatively similar, is that while accuracy eventually becomes the dominant concern, along the way the prior is *sort of* a random perturbation to this which changes the posterior weight, so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length. Put another way, for benignity, we don’t need concern for speed to dominate concern for accuracy; we need it to dominate concern for “simplicity” (on some reference machine).
Yeah, I understand this part, but I’m not sure why, since the benign one can be extremely complex, the malign one can’t have enough of a K-complexity advantage to overcome its slowness penalty. And since (with low β) we’re going through many more different world models as the number of episodes increases, that also gives malign world models more chances to “win”? It seems hard to make any trustworthy conclusions based on the kind of informal reasoning we’ve been doing and we need to figure out the actual math somehow.
Check out the order of the quantifiers in the proofs. One β works for all possibilities. If the quantifiers were in the other order, they couldn’t be trivially flipped since the number of world-models is infinite, and the intuitive worry about malign world-models getting “more chances to win” would apply.
Let’s continue the conversation here, and this may be a good place to reference this comment.
Sure, approaching from below is obvious, but that still requires knowing how wide the band of β that would produce a safe and useful BoMAI is, otherwise even if the band exists you could overshoot it and end up in the unsafe region.
ETA: But the first question is, is there a β such that BoMAI is both safe and intelligent enough to answer questions like “how to build a safe unbounded AGI”? When β is very low BoMAI is useless, and as you increase β it gets smarter, but then at some point with a high enough β it becomes unsafe. Do you know a way to figure out how smart BoMAI is just before it becomes unsafe?
Some visualizations which might help with this:
But then one needs to factor in “simplicity” or the prior penalty from description length:
Note also that these are average effects; they are just for forming intuitions.
Your concern was:
This was the sort of thing I assumed could be improved upon later once the asymptotic result was established. Now that you’re asking for the improvement, here’s a proposal:
Set β safely. Once enough observations have been provided that you believe human-level AI should be possible, exclude world-models that use less than s←1 computation steps per episode. Every episode, increase s until human-level performance is reached. Under the assumption that the average computation time of a malign world-model is at least a constant times that of the “corresponding” benign one (corresponding in the sense of using the same ((coarse) approximate) simulation of the world), then s←αs should be safe for some α>1 (and α−1≉0).
I need to think more carefully about what happens here, but I think the design space is large.
Fixed your images. You have to press space after you use that syntax for the images to actually get fetched and displayed. Sorry for the confusion.
Thanks!
Longer response coming. On hold for now.
All the better. They don’t what know universe is using the prior. What are the odds our universe is the single most susceptible universe to being taken over?
I was assuming the worst, and guessing that there are diminishing marginal returns once your odds of a successful takeover get above ~50%, so instead of going all in on accurate predictions of the weakest and ripest target universe, you hedge and target a few universes. And I was assuming the worst in assuming they’d be so good at this, they’d be able to do this for a large number of universes at once.
To clarify: diminishing marginal returns of takeover probability of a universe with respect to the weight you give that universe in your prior that you pipe to output.
There are massive diminishing marginal returns; in a naive model you’d expect essentially *every* universe to get predicted in this way.
But Wei Dai’s basic point still stands. The speed prior isn’t the actual prior over universes (i.e. doesn’t reflect the real degree of moral concern that we’d use to weigh consequences of our decisions in different possible worlds). If you have some data that you are trying to predict, you can do way better than the speed prior by (a) using your real prior to estimate or sample from the actual posterior distribution over physical law, (b) using engineering reasoning to make the utility maximizing predictions, given that faster predictions are going to get given more weight.
(You don’t really need this to run Wei Dai’s argument, because there seem to be dozens of ways in which the aliens get an advantage over the intended physical model.)
I think what you’re saying is that the following don’t commute:
“real prior” (universal prior) + speed update + anthropic update + can-do update + worth-doing update
compared to
universal prior + anthropic update + can-do update + worth-doing update + speed update
When universal prior is next to speed update, this is naturally conceptualized as a speed prior, and when it’s last, it is naturally conceptualized as “engineering reasoning” identifying faster predictions.
I happy to go with the second order if you prefer, in part because I think they do commute—all these updates just change the weights on measures that get mixed together to be piped to output during the “predict accurately” phase.
You have a countable list of options. What choice do you have but to favor the ones at the beginning? Any (computable) permutation of the things on the list just corresponds to a different choice of universal Turing machine for which a “short” algorithm just means it’s earlier on the list.
And a “sequence of increasingly better algorithms,” if chosen in a computable way, is just a computable algorithm.
The fast algorithms to predict our physics just aren’t going to be the shortest ones. You can use reasoning to pick which one to favor (after figuring out physics), rather than just writing them down in some arbitrary order and taking the first one.
Using “reasoning” to pick which one to favor, is just picking the first one in some new order. (And not really picking the first one, just giving earlier ones preferential treatment). In general, if you have an infinite list of possibilities, and you want to pick the one that maximizes some property, this is not a procedure that halts. I’m agnostic about what order you use (for now) but one can’t escape the necessity to introduce the arbitrary criterion of “valuing” earlier things on the list. One can put 50% probability mass on the first billion instead of the first 1000 if one wants to favor “simplicity” less, but you can’t make that number infinity.
Yes, some new order, but not an arbitrary one. The resulting order is going to be better than the speed prior order, so we’ll update in favor of the aliens and away from the rest of the speed prior.
Probably some miscommunication here. No one is trying to object to the arbitrariness, we’re just making the point that the aliens have a lot of leverage with which to beat the rest of the speed prior.
(They may still not be able to if the penalty for computation is sufficiently steep—e.g. if you penalize based on circuit complexity so that the model might as well bake in everything that doesn’t depend on the particular input at hand. I think it’s an interesting open question whether that avoids all problems of this form, which I unsuccessfully tried to get at here.)
It was definitely reassuring to me that someone else had had the thought that prioritizing speed could eliminate optimization daemons (re: minimal circuits), since the speed prior came in here for independent reasons. The only other approach I can think of is trying to do the anthropic update ourselves.
If you haven’t seen Jessica’s post in this area, it’s worth taking a quick look.
The only point I was trying to respond to in the grandparent of this comment was your comment
Your concern (I think) is that our speed prior would assign a lower probability to [fast approximation of real world] than the aliens’ speed prior.
I can’t respond at once to all of the reasons you have for this belief, but the one I was responding to here (which hopefully we can file away before proceeding) was that our speed prior trades off shortness with speed, and aliens could avoid this and only look at speed.
My point here was just that there’s no way to not trade off shortness with speed, so no one has a comparative advantage on us as result of the claim “The fast algorithms to predict our physics just aren’t going to be the shortest ones.”
The “after figuring out physics” part is like saying that they can use a prior which is updated based on evidence. They will observe evidence for what our physics is like, and use that to update their posterior, but that’s exactly what we’re doing to. The prior they start with can’t be designed around our physics. I think that the only place this reasoning gets you is that their posterior will assign a higher probability to [fast approximation of real world] than our prior does, because the world-models have been reasonably reweighted in light of their “figuring out physics”. Of course I don’t object to that—our speed prior’s posterior will be much better than the prior too.
It seems totally different from what we’re doing, I may be misunderstanding the analogy.
Suppose I look out at the world and do some science, e.g. discovering the standard model. Then I use my understanding of science to design great prediction algorithms that run fast, but are quite complicated owing to all of the approximations and heuristics baked into them.
The speed prior gives this model a very low probability because it’s a complicated model. But “do science” gives this model a high probability, because it’s a simple model of physics, and then the approximations follow from a bunch of reasoning on top of that model of physics. We aren’t trading off “shortness” for speed—we are trading off “looks good according to reasoning” for speed. Yes they are both arbitrary orders, but one of them systematically contains better models earlier in the order, since the output of reasoning is better than a blind prioritization of shorter models.
Of course the speed prior also includes a hypothesis that does “science with the goal of making good predictions,” and indeed Wei Dai and I are saying that this is the part of the speed prior that will dominate the posterior. But now we are back to potentially-malign consequentialistism. The cognitive work being done internally to that hypothesis is totally different from the work being done by updating on the speed prior (except insofar as the speed prior literally contains a hypothesis that does that work).
In other words:
Suppose physics takes n bits to specify, and a reasonable approximation takes N >> n bits to specify. Then the speed prior, working in the intended way, takes N bits to arrive at the reasonable approximation. But the aliens take n bits to arrive at the standard model, and then once they’ve done that can immediately deduce the N bit approximation. So it sure seems like they’ll beat the speed prior. Are you objecting to this argument?
(In fact the speed prior only actually takes n + O(1) bits, because it can specify the “do science” strategy, but that doesn’t help here since we are just trying to say that the “do science” strategy dominates the speed prior.)
I’m not sure which of these arguments will be more convincing to you.
This is what is what I was trying to contextualize above. This is an unfair comparison. You’re imagining that the “reasoning”-based order gets to see past observations, and the “shortness”-based order does not. A reasoning-based order is just a shortness-based order that has been updated into a posterior after seeing observations (under the view that good reasoning is Bayesian reasoning). Maybe the term “order” is confusing us, because we both know it’s a distribution, not an order, and we were just simplifying to a ranking. A shortness-based order should really just be called a prior, and a reasoning-based order (at least a Bayesian-reasoning-based order) should really just be called a posterior (once it has done some reasoning; before it has done the reasoning, it is just a prior too). So yes, the whole premise of Bayesian reasoning is that updating based on reasoning is a good thing to do.
Here’s another way to look at it.
The speed prior is doing the brute force search that scientists try to approximate efficiently. The search is for a fast approximation of the environment. The speed prior considers them all. The scientists use heuristics to find one.
Exactly. But this does help for reasons I describe here. The description length of the “do science” strategy (I contend) is less than the description length of the “do science” + “treacherous turn” strategy. (I initially typed that as “tern”, which will now be the image I have of a treacherous turn.)
Reasoning gives you a prior that is better than the speed prior, before you see any data. (*Much* better, limited only by the fact that the speed prior contains strategies which use reasoning.)
The reasoning in this case is not a Bayesian update. It’s evaluating possible approximations *by reasoning about how well they approximate the underlying physics, itself inferred by a Bayesian update*, not by directly seeing how well they predict on the data so far.
I can reply in that thread.
I think the only good arguments for this are in the limit where you don’t care about simplicity at all and only care about running time, since then you can rule out all reasoning. The threshold where things start working depends on the underlying physics, for more computationally complex physics you need to pick larger and larger computation penalties to get the desired result.
Given a world model ν, which takes k computation steps per episode, let νlog be the best world-model that best approximates ν (in the sense of KL divergence) using only logk computation steps.νlog is at least as good as the “reasoning-based replacement” of ν.
The description length of νlog is within a (small) constant of the description length of ν. That way of describing it is not optimized for speed, but it presents a one-time cost, and anyone arriving at that world-model in this way is paying that cost.
One could consider instead νlogε, which is, among the world-models that ε-approximate ν in less than logk computation steps (if the set is non-empty), the first such world-model found by a searching procedure ψ. The description length of νlogε is within a (slightly larger) constant of the description length of ν, but the one-time computational cost is less than that of νlog.
νlog, νlogε, and a host of other approaches are prominently represented in the speed prior.
If this is what you call “the speed prior doing reasoning,” so be it, but the relevance for that terminology only comes in when you claim that “once you’ve encoded ‘doing reasoning’, you’ve basically already written the code for it to do the treachery that naturally comes along with that.” That sense of “reasoning” really only applies, I think, to the case where our code is simulating aliens or an AGI.
(ETA: I think this discussion depended on a detail of your version of the speed prior that I misunderstood.)
To be clear, that description gets ~0 mass under the speed prior, right? A direct specification of the fast model is going to have a much higher prior than a brute force search, at least for values of β large enough (or small enough, however you set it up) to rule out the alien civilization that is (probably) the shortest description without regard for computational limits.
Within this chunk of the speed prior, the question is: what are good ψ? Any reasonable specification of a consequentialist would work (plus a few more bits for it to understand its situation, though most of the work is done by handing it ν), or of a petri dish in which a consequentialist would eventually end up with influence. Do you have a concrete alternative in mind, which you think is not dominated by some consequentialist (i.e. a ψ for which every consequentialist is either slower or more complex)?
Well one approach is in the flavor of the induction algorithm I messaged you privately about (I know I didn’t give you a completely specified algorithm). But when I wrote that, I didn’t have a concrete algorithm in mind. Mostly, it just seems to me that the powerful algorithms which have been useful to humanity have short descriptions in themselves. It seems like there are many cases where there is a simple “ideal” approach which consequentialists “discover” or approximately discover. A powerful heuristic search would be one such algorithm, I think.
I don’t think anything here changes if K(x) were replaced with S(x) (if that was what you misunderstood).
True but I’m arguing that this computable algorithm is just the alien itself, trying to answer the question “how can I better predict this richer world in order to take it over?” If there is no shorter/faster algorithm that can come up with a sequence of increasingly better algorithms, what is the point of saying that the alien is sampling from the speed prior, instead of saying that the alien is thinking about how to answer “how can I better predict this richer world in order to take it over?” Actually if this alien was sampling from the speed prior, then it would no longer be the shortest/fastest algorithm to come up with a sequence of increasingly better algorithms, and some other alien trying to take over our world would have the highest posterior instead.
I’m having a hard time following this. Can you expand on this, without using “sequence of increasingly better algorithms”? I keep translating that to “algorithm.”