Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.
B) even if it’s true, to take advantage of it would seem to require fine tuning β and I don’t see how to do that, given that trial-and-error wouldn’t be safe.
Fine tuning from both sides isn’t safe. Approach from below.
Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.
I just read the math more carefully, and it looks like no matter how small β is, as long as β is positive, as BoMAI receives more and more input, it will eventually converge to the most accurate world model possible. This is because the computation penalty is applied to the per-episode computation bound and doesn’t increase with each episode, whereas the accuracy advantage gets accumulated across episodes.
Assuming that the most accurate world model is an exponential-time quantum simulation, that’s what BoMAI will converge to (no matter how small β is), right? And in the meantime it will go through some arbitrarily complex (up to some very large bound) but faster than exponential classical approximations of quantum physics that are increasingly accurate, as the number of episodes increase? If so, I’m no longer convinced that BoMAI is benign as long as β is small enough, because the qualitative behavior of BoMAI seems the same no matter what β is, i.e., it gets smarter over time as its world model gets more accurate, and I’m not sure why the reason BoMAI might not be benign at high β couldn’t also apply at low β (if we run it for a long enough time).
(If you’re going to discuss all this in your “longer reply”, I’m fine with waiting for it.)
The longer reply will include an image that might help, but a couple other notes. If it causes you to doubt the asymptotic result, it might be helpful to read the benignity proof (especially the proof of Rejecting the Simple Memory-Based Lemma, which isn’t that long). The heuristic reason for why it can be helpful to decrease β for long-run behavior, even though long-run behavior is qualitatively similar, is that while accuracy eventually becomes the dominant concern, along the way the prior is *sort of* a random perturbation to this which changes the posterior weight, so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length. Put another way, for benignity, we don’t need concern for speed to dominate concern for accuracy; we need it to dominate concern for “simplicity” (on some reference machine).
so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length
Yeah, I understand this part, but I’m not sure why, since the benign one can be extremely complex, the malign one can’t have enough of a K-complexity advantage to overcome its slowness penalty. And since (with low β) we’re going through many more different world models as the number of episodes increases, that also gives malign world models more chances to “win”? It seems hard to make any trustworthy conclusions based on the kind of informal reasoning we’ve been doing and we need to figure out the actual math somehow.
And since (with low β) we’re going through many more different world models as the number of episodes increases, that also gives malign world models more chances to “win”?
Check out the order of the quantifiers in the proofs. One β works for all possibilities. If the quantifiers were in the other order, they couldn’t be trivially flipped since the number of world-models is infinite, and the intuitive worry about malign world-models getting “more chances to win” would apply.
Let’s continue the conversation here, and this may be a good place to reference this comment.
Fine tuning from both sides isn’t safe. Approach from below.
Sure, approaching from below is obvious, but that still requires knowing how wide the band of β that would produce a safe and useful BoMAI is, otherwise even if the band exists you could overshoot it and end up in the unsafe region.
ETA: But the first question is, is there a β such that BoMAI is both safe and intelligent enough to answer questions like “how to build a safe unbounded AGI”? When β is very low BoMAI is useless, and as you increase β it gets smarter, but then at some point with a high enough β it becomes unsafe. Do you know a way to figure out how smart BoMAI is just before it becomes unsafe?
But then one needs to factor in “simplicity” or the prior penalty from description length:
Note also that these are average effects; they are just for forming intuitions.
Your concern was:
is there a β such that BoMAI is both safe and intelligent enough to answer questions like “how to build a safe unbounded AGI” [after a reasonable number of episodes]?
This was the sort of thing I assumed could be improved upon later once the asymptotic result was established. Now that you’re asking for the improvement, here’s a proposal:
Set β safely. Once enough observations have been provided that you believe human-level AI should be possible, exclude world-models that use less than s←1 computation steps per episode. Every episode, increase s until human-level performance is reached. Under the assumption that the average computation time of a malign world-model is at least a constant times that of the “corresponding” benign one (corresponding in the sense of using the same ((coarse) approximate) simulation of the world), then s←αs should be safe for some α>1 (and α−1≉0).
I need to think more carefully about what happens here, but I think the design space is large.
Just as you said: it outputs Bernoulli(1/2) bits for a long time. It’s not dangerous.
Fine tuning from both sides isn’t safe. Approach from below.
I just read the math more carefully, and it looks like no matter how small β is, as long as β is positive, as BoMAI receives more and more input, it will eventually converge to the most accurate world model possible. This is because the computation penalty is applied to the per-episode computation bound and doesn’t increase with each episode, whereas the accuracy advantage gets accumulated across episodes.
Assuming that the most accurate world model is an exponential-time quantum simulation, that’s what BoMAI will converge to (no matter how small β is), right? And in the meantime it will go through some arbitrarily complex (up to some very large bound) but faster than exponential classical approximations of quantum physics that are increasingly accurate, as the number of episodes increase? If so, I’m no longer convinced that BoMAI is benign as long as β is small enough, because the qualitative behavior of BoMAI seems the same no matter what β is, i.e., it gets smarter over time as its world model gets more accurate, and I’m not sure why the reason BoMAI might not be benign at high β couldn’t also apply at low β (if we run it for a long enough time).
(If you’re going to discuss all this in your “longer reply”, I’m fine with waiting for it.)
The longer reply will include an image that might help, but a couple other notes. If it causes you to doubt the asymptotic result, it might be helpful to read the benignity proof (especially the proof of Rejecting the Simple Memory-Based Lemma, which isn’t that long). The heuristic reason for why it can be helpful to decrease β for long-run behavior, even though long-run behavior is qualitatively similar, is that while accuracy eventually becomes the dominant concern, along the way the prior is *sort of* a random perturbation to this which changes the posterior weight, so for two world-models that are exactly equally accurate, we need to make sure the malign one is penalized for being slower, enough to outweigh the inconvenient possible outcome in which it has shorter description length. Put another way, for benignity, we don’t need concern for speed to dominate concern for accuracy; we need it to dominate concern for “simplicity” (on some reference machine).
Yeah, I understand this part, but I’m not sure why, since the benign one can be extremely complex, the malign one can’t have enough of a K-complexity advantage to overcome its slowness penalty. And since (with low β) we’re going through many more different world models as the number of episodes increases, that also gives malign world models more chances to “win”? It seems hard to make any trustworthy conclusions based on the kind of informal reasoning we’ve been doing and we need to figure out the actual math somehow.
Check out the order of the quantifiers in the proofs. One β works for all possibilities. If the quantifiers were in the other order, they couldn’t be trivially flipped since the number of world-models is infinite, and the intuitive worry about malign world-models getting “more chances to win” would apply.
Let’s continue the conversation here, and this may be a good place to reference this comment.
Sure, approaching from below is obvious, but that still requires knowing how wide the band of β that would produce a safe and useful BoMAI is, otherwise even if the band exists you could overshoot it and end up in the unsafe region.
ETA: But the first question is, is there a β such that BoMAI is both safe and intelligent enough to answer questions like “how to build a safe unbounded AGI”? When β is very low BoMAI is useless, and as you increase β it gets smarter, but then at some point with a high enough β it becomes unsafe. Do you know a way to figure out how smart BoMAI is just before it becomes unsafe?
Some visualizations which might help with this:
But then one needs to factor in “simplicity” or the prior penalty from description length:
Note also that these are average effects; they are just for forming intuitions.
Your concern was:
This was the sort of thing I assumed could be improved upon later once the asymptotic result was established. Now that you’re asking for the improvement, here’s a proposal:
Set β safely. Once enough observations have been provided that you believe human-level AI should be possible, exclude world-models that use less than s←1 computation steps per episode. Every episode, increase s until human-level performance is reached. Under the assumption that the average computation time of a malign world-model is at least a constant times that of the “corresponding” benign one (corresponding in the sense of using the same ((coarse) approximate) simulation of the world), then s←αs should be safe for some α>1 (and α−1≉0).
I need to think more carefully about what happens here, but I think the design space is large.
Fixed your images. You have to press space after you use that syntax for the images to actually get fetched and displayed. Sorry for the confusion.
Thanks!
Longer response coming. On hold for now.