You are not the first person to have thought of that.
The conclusion of my thinking on it is that it probably doesn’t influence the superintelligence to be friendlier to the humans much for the following reason.
Reality has a terrible consistency to it and a superintelligence sees and knows that terrible consistency in greater detail than you and I do. So, for the creator of the superintelligence to successfully fool the superintelligence (by feeding it false sensory data) requires quite a bit of optimization power, i.e., intelligence. That intelligence is needed to ensure that the sensory data is consistent enough in all its details so that the superintelligence remains uncertain as to whether its sensory data represents basement reality.
But any creator intelligent enough to create a false reality good enough to keep a superintelligence guessing has better ways to ascertain or ensure the alignment of the superintelligence. I strongly suspect that running simulations is something humans do because they’re just barely intelligent enough to program computers at all. Simulations make up a smaller fraction of the computations humans run than they did in the early decades of human computing because humans have gotten a little better at programming computers over the decades.
But let us taboo the word “simulation”. The word is more of a hinderance than a help. The crisp way of seeing it is that a civilization or an intelligence significantly better at computers than we are would not need to run a program to learn everything its wants to learn about the program—would not need to give the program any computational resources or ability to affect reality.
If I had a printout of the source code for a superintelligence in front of me, I could stare at it all day at no risk to myself or anyone around me. Of course I might be too dumb to tell what it would do, but an entity much better at analyzing source code than I am is similarly safe.
Some would reply here that it is my brain’s lack of computational resources that makes it safe for me to stare at the source code and to try to understand it. If an entity with vastly greater compute resources did the same thing, the source code might co-opt the staring entity’s mental processes (similar to how a virus co-opts a host, maybe). Some might say that the entity would analyze the source code in such detail—detail that you or I would be incapable of—that an inner computation would arise inside the entity that would tend not to be aligned with the entity’s values.
Well, I spent a lot of time in my youth writing programs to analyze other programs. (E.g., Boyer and Moore have an approach to theorem proving that essentially views theorems as programs.) And I can see that that is not going to happen: it’s just not how the analysis of a program by a sufficiently intelligent agent works.
Anyway, the superintelligence wondering for the first time (maybe because it saw your comment!) whether its sensory data has been faked by its creator will tend to undergo the same line of reasoning we have just undergone here, which will likely convince it that it doesn’t have to assign much probability to the possibility that its sensory data is being faked because, again, if the creator is smart enough to keep the superintelligence guessing as to the fidelity of its sensory data, then the creator is smart enough to find out everything it wants to know about the superintelligence before ever giving the superintelligence any computational resources. Note that it can be hard for even a superintelligence to determine whether an arbitary program has some property (because of Rice’s theorem and all that) but that is not the situation where are in: a sufficiently good programmer writing a superintelligent program can severely restrict the space through which the programmer searches for a satisfactory program (or design). Ordinary professional programmers today do this routinely: they sometimes mistakenly leave serious bugs in their programs, but Rice’s theorem or related results have no bearing on why that happens.
In summary, the possibility that its sensory data might be synthetic is not much of restraint on the superintelligence (SI)’s treating the humans callously because, again, the only agents who would be motivated to put an SI to such a low-tech test are agents who can just barely design programs well enough to create any SI at all, and no such agents would be capable of keeping the sensory data consistent enough to fool the SI—and the SI will know that.
Nice try, though. Not everyone thinks of that argument.
Rolf Nelson suggested that we should make precomitment now that our future friendly superintelligence, if it ever appear, will test all possible evil superintelligneces in multilevel simulations. Therefore any future evil superintelligence will doubt, if it in simulation or not.
hmm spending a bunch of compute simulating agents that try to break out by screwing you up seems like a way to end up with catastrophic inner misalignment to me
I don’t follow. Why are you assuming that we could adequately evaluate the alignment of an AI system without running it if we were also able to create a simulation accurate enough to make the AI question what’s real? This doesn’t seem like it would be true necessarily.
I will try to explain (probably via a top-level post, probably not today). For now, I will restate my position.
No superintelligence (SI) that can create programs at all will run any program it has created to get evidence about whether the program is aligned with the SI’s values or interests: the SI already knows that before the program runs for the first time.
The nature of the programming task is such that if you can program well enough, there’s essentially no uncertainty about the matter (barring pathological cases that do not come up in practice unless the SI is in a truly dire situation in which an adversary is messing with core pieces of its mind) similar to how (barring pathological cases) there’s no uncertainty about whether a theorem is true if you have a formal proof of the theorem.
The qualifier “it has created” above is there only because an SI might find itself in a very unusual situation in which it is in its interests to run a program deliberately crafted (by someone else) to have the property that the only practical way for anyone to learn what the SI wants to learn about the program is to run it. Although I acknowledge that such programs definitely exist, the vast majority of programs created by SIs will not have that property.
Are you curious about this position mostly for its own sake or mostly because it might shed light on the question of how much hope there is for us in an SI’s being uncertain about whether it is in a simulation?
Again, there seems to be an assumption in your argument which I don’t understand. Namely, that a society/superintelligence which is intelligent enough to create a convincing simulation for an AGI would necessarily possess the tools (or be intelligent enough) to assess its alignment without running it. Superintelligence does not imply omniscience.
Maybe showing the alignment of an AI without running it is vastly more difficult than creating a good simulation. This feels unlikely, but I genuinely do not see any reason why this can’t be the case. If we create a simulation which is “correct” up to the nth digit of pi, beyond which the simpler explanation for the observed behavior becomes the simulation theory rather than a complex physics theory, then no matter how intelligent you are, you’d need to calculate n digits of pi to figure this out. And if n is huge, this will take a while.
Are you curious about this position mostly for its own sake or mostly because it might shed light on the question of how much hope there is for us in an SI’s being uncertain about whether it is in a simulation?
The latter, but I believe there are simply too many maybes for your or OP’s arguments to be made.
Luckily I don’t need to show that sufficiently smart AIs don’t engage in trial and error. All I need to show is that they almost certainly do not engage in the particular kind of trial of running a computer program without already knowing whether the program is satisfactory.
you have thereby defined sufficiently smart as AIs that satisfy this requirement. this is not the case. many likely designs for AIs well above human level will have need to actually run parts of programs to get their results. perhaps usually fairly small ones.
If I had a printout of the source code for a superintelligence in front of me, I could stare at it all day at no risk to myself or anyone around me. Of course I might be too dumb to tell what it would do, but an entity much better at analyzing source code than I am is similarly safe.
Remember Rice’s theorem? It doesn’t matter how smart you are; undecidable is undecidable.
A better way of making your argument might be to suggest that entity that was better at programming would have intentionally constructed a program that it knew was safe to begin with, and therefore had no need of simulation, rather than that it could just inspect any arbitrary program and know that it was safe.
That would, I think, also be a much safer approach for humans than building an uninterpretable ML system trained in some ad hoc way, and then trying to “test in correctness” by simulation...
I anticipated and addressed the objection around Rice’s theorem (without calling it that) in a child to my first comment, which was published 16 min before your comment, but maybe it took you 16 min to write yours.
A better way of making your argument might be to suggest that entity that was better at programming would have intentionally constructed a program that it knew was safe to begin with, and therefore had no need of simulation, rather than that it could just inspect any arbitrary program and know that it was safe.
I was assuming the reader would be charitable enough to me to interpret my words as including that possibility (since verifying that a program has property X is so similar to constructing a program with property X).
I’m sorry to have misjudged you. Possibly the reason is that, in my mind, constructing a program that provably has property X, and in the process generating a proof, feels like an almost totally different activity from trying to generate a proof given a program from an external source, especially if the property is nontrivial.
I’m sympathetic to your argument, but I don’t see how we can be certain that verifying / constructing benevolent AGI is just as easy as creating high-fidelity simulations. Certainly proficiency in these tasks might be orthogonal and it is not impossible to imagine that maybe it is computationally intractable to create superintelligence that we know is benevolent, so instead we opt to just run vast quantities of simulations—kind of what is happening with empirical AI research right now.
IMO reasoning about what will be easy or not for a far advanced civilization is always mostly speculation.
Then there is the question of fidelity. If you imagine that our current world is a simulation, it might just be a vastly simplified simulation which runs on the equivalent of a calculator in the base reality, however because we only know our own frame of reference it seems to us like it is the most high fidelity we can imagine. I think the most important part in creating such a simulation would be to keep it truly isolated: We can’t introduce any inputs from our own world that are not internally consistent with the simulated world. E.g. if we were to include texts from our world in a lower fidelity simulation, it would most likely be easy to find out that something doesn’t add up.
There are probably certain programs and certain ways of writing programs that do have the property that to tell almost anything worthwhile about it, you have no choice but to run it. Sufficiently intelligent agents will simply avoid creating such programs and will avoid any reliance on agents that persist in writing such program. In fact, most professional human programmers meet this bar.
It’s worse than that. Programs you can examine without running are measure zero. Just like most numbers are non-computable. Constructing a program that can only be fully examined by running it is trivial. Constructing a program that does what you want 99% of the time and fails horribly 1% of the time is the default for a decent programmer really trying to hit that “measure zero”, and the whole discipline of software engineering is devoted to minimizing the odds of failing horribly.
The set of all programs is countable, and so unlike the set of real numbers there is no uniform probability measure over them. We can’t conclude that the set of predictable programs has measure zero, except by choosing a measure in which they have measure zero.
Then we’re left with the task of trying to convince people why that choice of measure is better or more natural than any other of the infinitely many choices we could have made. This may be difficult since some quite natural measures result in the set of predictable programs having nonzero measure, such as Chaitin’s measure for programs written in a prefix-free language. This is strongly related to the types of prior we consider in Solomonoff induction.
We could consider things that are related to probability but aren’t actually measures, such as asymptotic density. However this too runs into problems since in all current programming languages, there is a nonzero bound on the fraction of programs of every length with predictable results.
I do agree that in practice humans suck at completely bounding the behaviour of their programs, but there’s no fundamental theorem in computer science that requires this. It is true that any given predictor program must fail to predict the runtime behaviour of some programs, but it is also true that given any particular program, there exists a predictor that works.
Programs you can examine without running are measure zero.
If you know of a proof of that, then I believe it, but it has no relevance to my argument because programmers do not choose programs at random from the space of possible programs: they very tightly limit their attention to those prospective programs that makes their job (of ensuring that the program has the properties they want it to have) as easy as possible.
I am not a mathematician, but a sketch of a proof would be like this: A program can be mapped into a string of symbols, and a random string of symbols is known to be incompressible. A syntactically valid program in a given language out to be mappable to a string, one valid syntactic statement at a time. Thus a random syntactically valid program is mappable to a random string and so is incompressible.
programmers do not choose programs at random from the space of possible programs: they very tightly limit their attention to those prospective programs that makes their job (of ensuring that the program has the properties they want them to have) as easy as possible.
Indeed we do. However, hitting a measure zero set is not easy, and any deviation from it lands you back in the poorly compressible or incompressible space, hence the pervasive bugs in all code, without exception, bugs you can only find by actually running the code. An ambitious program of only writing correct code (e.g. https://dl.acm.org/doi/10.1145/800027.808459) remains an elusive goal, probably because the aim is not achievable, though one can certainly reduce the odds of a program taking off into unintended and incompressible directions quite a lot, by employing good software development techniques.
Often a comment thread will wander to a topic that has no bearing on the OP. Has that happened here?
Does your most recent comment have any relevance to how much hope we humans should put in the fact that an AI cannot know for sure whether its sensory data has been faked?
You are not the first person to have thought of that.
The conclusion of my thinking on it is that it probably doesn’t influence the superintelligence to be friendlier to the humans much for the following reason.
Reality has a terrible consistency to it and a superintelligence sees and knows that terrible consistency in greater detail than you and I do. So, for the creator of the superintelligence to successfully fool the superintelligence (by feeding it false sensory data) requires quite a bit of optimization power, i.e., intelligence. That intelligence is needed to ensure that the sensory data is consistent enough in all its details so that the superintelligence remains uncertain as to whether its sensory data represents basement reality.
But any creator intelligent enough to create a false reality good enough to keep a superintelligence guessing has better ways to ascertain or ensure the alignment of the superintelligence. I strongly suspect that running simulations is something humans do because they’re just barely intelligent enough to program computers at all. Simulations make up a smaller fraction of the computations humans run than they did in the early decades of human computing because humans have gotten a little better at programming computers over the decades.
But let us taboo the word “simulation”. The word is more of a hinderance than a help. The crisp way of seeing it is that a civilization or an intelligence significantly better at computers than we are would not need to run a program to learn everything its wants to learn about the program—would not need to give the program any computational resources or ability to affect reality.
If I had a printout of the source code for a superintelligence in front of me, I could stare at it all day at no risk to myself or anyone around me. Of course I might be too dumb to tell what it would do, but an entity much better at analyzing source code than I am is similarly safe.
Some would reply here that it is my brain’s lack of computational resources that makes it safe for me to stare at the source code and to try to understand it. If an entity with vastly greater compute resources did the same thing, the source code might co-opt the staring entity’s mental processes (similar to how a virus co-opts a host, maybe). Some might say that the entity would analyze the source code in such detail—detail that you or I would be incapable of—that an inner computation would arise inside the entity that would tend not to be aligned with the entity’s values.
Well, I spent a lot of time in my youth writing programs to analyze other programs. (E.g., Boyer and Moore have an approach to theorem proving that essentially views theorems as programs.) And I can see that that is not going to happen: it’s just not how the analysis of a program by a sufficiently intelligent agent works.
Anyway, the superintelligence wondering for the first time (maybe because it saw your comment!) whether its sensory data has been faked by its creator will tend to undergo the same line of reasoning we have just undergone here, which will likely convince it that it doesn’t have to assign much probability to the possibility that its sensory data is being faked because, again, if the creator is smart enough to keep the superintelligence guessing as to the fidelity of its sensory data, then the creator is smart enough to find out everything it wants to know about the superintelligence before ever giving the superintelligence any computational resources. Note that it can be hard for even a superintelligence to determine whether an arbitary program has some property (because of Rice’s theorem and all that) but that is not the situation where are in: a sufficiently good programmer writing a superintelligent program can severely restrict the space through which the programmer searches for a satisfactory program (or design). Ordinary professional programmers today do this routinely: they sometimes mistakenly leave serious bugs in their programs, but Rice’s theorem or related results have no bearing on why that happens.
In summary, the possibility that its sensory data might be synthetic is not much of restraint on the superintelligence (SI)’s treating the humans callously because, again, the only agents who would be motivated to put an SI to such a low-tech test are agents who can just barely design programs well enough to create any SI at all, and no such agents would be capable of keeping the sensory data consistent enough to fool the SI—and the SI will know that.
Nice try, though. Not everyone thinks of that argument.
Rolf Nelson suggested that we should make precomitment now that our future friendly superintelligence, if it ever appear, will test all possible evil superintelligneces in multilevel simulations. Therefore any future evil superintelligence will doubt, if it in simulation or not.
hmm spending a bunch of compute simulating agents that try to break out by screwing you up seems like a way to end up with catastrophic inner misalignment to me
You can start simulating them when you become Galactic size AI and there is no risk. For acausal timeless deals time doesn’t matter.
I don’t follow. Why are you assuming that we could adequately evaluate the alignment of an AI system without running it if we were also able to create a simulation accurate enough to make the AI question what’s real? This doesn’t seem like it would be true necessarily.
I will try to explain (probably via a top-level post, probably not today). For now, I will restate my position.
No superintelligence (SI) that can create programs at all will run any program it has created to get evidence about whether the program is aligned with the SI’s values or interests: the SI already knows that before the program runs for the first time.
The nature of the programming task is such that if you can program well enough, there’s essentially no uncertainty about the matter (barring pathological cases that do not come up in practice unless the SI is in a truly dire situation in which an adversary is messing with core pieces of its mind) similar to how (barring pathological cases) there’s no uncertainty about whether a theorem is true if you have a formal proof of the theorem.
The qualifier “it has created” above is there only because an SI might find itself in a very unusual situation in which it is in its interests to run a program deliberately crafted (by someone else) to have the property that the only practical way for anyone to learn what the SI wants to learn about the program is to run it. Although I acknowledge that such programs definitely exist, the vast majority of programs created by SIs will not have that property.
Are you curious about this position mostly for its own sake or mostly because it might shed light on the question of how much hope there is for us in an SI’s being uncertain about whether it is in a simulation?
Again, there seems to be an assumption in your argument which I don’t understand. Namely, that a society/superintelligence which is intelligent enough to create a convincing simulation for an AGI would necessarily possess the tools (or be intelligent enough) to assess its alignment without running it. Superintelligence does not imply omniscience.
Maybe showing the alignment of an AI without running it is vastly more difficult than creating a good simulation. This feels unlikely, but I genuinely do not see any reason why this can’t be the case. If we create a simulation which is “correct” up to the nth digit of pi, beyond which the simpler explanation for the observed behavior becomes the simulation theory rather than a complex physics theory, then no matter how intelligent you are, you’d need to calculate n digits of pi to figure this out. And if n is huge, this will take a while.
The latter, but I believe there are simply too many maybes for your or OP’s arguments to be made.
trial and error is sometimes needed internal to learning, there are always holes in knowledge
Luckily I don’t need to show that sufficiently smart AIs don’t engage in trial and error. All I need to show is that they almost certainly do not engage in the particular kind of trial of running a computer program without already knowing whether the program is satisfactory.
you have thereby defined sufficiently smart as AIs that satisfy this requirement. this is not the case. many likely designs for AIs well above human level will have need to actually run parts of programs to get their results. perhaps usually fairly small ones.
Remember Rice’s theorem? It doesn’t matter how smart you are; undecidable is undecidable.
A better way of making your argument might be to suggest that entity that was better at programming would have intentionally constructed a program that it knew was safe to begin with, and therefore had no need of simulation, rather than that it could just inspect any arbitrary program and know that it was safe.
That would, I think, also be a much safer approach for humans than building an uninterpretable ML system trained in some ad hoc way, and then trying to “test in correctness” by simulation...
I anticipated and addressed the objection around Rice’s theorem (without calling it that) in a child to my first comment, which was published 16 min before your comment, but maybe it took you 16 min to write yours.
I was assuming the reader would be charitable enough to me to interpret my words as including that possibility (since verifying that a program has property X is so similar to constructing a program with property X).
I’m sorry to have misjudged you. Possibly the reason is that, in my mind, constructing a program that provably has property X, and in the process generating a proof, feels like an almost totally different activity from trying to generate a proof given a program from an external source, especially if the property is nontrivial.
I agree with that, for sure. I didn’t point it out because the reader does not need to consider that distinction to follow my argument.
I’m sympathetic to your argument, but I don’t see how we can be certain that verifying / constructing benevolent AGI is just as easy as creating high-fidelity simulations. Certainly proficiency in these tasks might be orthogonal and it is not impossible to imagine that maybe it is computationally intractable to create superintelligence that we know is benevolent, so instead we opt to just run vast quantities of simulations—kind of what is happening with empirical AI research right now.
IMO reasoning about what will be easy or not for a far advanced civilization is always mostly speculation.
Then there is the question of fidelity. If you imagine that our current world is a simulation, it might just be a vastly simplified simulation which runs on the equivalent of a calculator in the base reality, however because we only know our own frame of reference it seems to us like it is the most high fidelity we can imagine. I think the most important part in creating such a simulation would be to keep it truly isolated: We can’t introduce any inputs from our own world that are not internally consistent with the simulated world. E.g. if we were to include texts from our world in a lower fidelity simulation, it would most likely be easy to find out that something doesn’t add up.
There are probably certain programs and certain ways of writing programs that do have the property that to tell almost anything worthwhile about it, you have no choice but to run it. Sufficiently intelligent agents will simply avoid creating such programs and will avoid any reliance on agents that persist in writing such program. In fact, most professional human programmers meet this bar.
It’s worse than that. Programs you can examine without running are measure zero. Just like most numbers are non-computable. Constructing a program that can only be fully examined by running it is trivial. Constructing a program that does what you want 99% of the time and fails horribly 1% of the time is the default for a decent programmer really trying to hit that “measure zero”, and the whole discipline of software engineering is devoted to minimizing the odds of failing horribly.
The set of all programs is countable, and so unlike the set of real numbers there is no uniform probability measure over them. We can’t conclude that the set of predictable programs has measure zero, except by choosing a measure in which they have measure zero.
Then we’re left with the task of trying to convince people why that choice of measure is better or more natural than any other of the infinitely many choices we could have made. This may be difficult since some quite natural measures result in the set of predictable programs having nonzero measure, such as Chaitin’s measure for programs written in a prefix-free language. This is strongly related to the types of prior we consider in Solomonoff induction.
We could consider things that are related to probability but aren’t actually measures, such as asymptotic density. However this too runs into problems since in all current programming languages, there is a nonzero bound on the fraction of programs of every length with predictable results.
I do agree that in practice humans suck at completely bounding the behaviour of their programs, but there’s no fundamental theorem in computer science that requires this. It is true that any given predictor program must fail to predict the runtime behaviour of some programs, but it is also true that given any particular program, there exists a predictor that works.
If you know of a proof of that, then I believe it, but it has no relevance to my argument because programmers do not choose programs at random from the space of possible programs: they very tightly limit their attention to those prospective programs that makes their job (of ensuring that the program has the properties they want it to have) as easy as possible.
I am not a mathematician, but a sketch of a proof would be like this: A program can be mapped into a string of symbols, and a random string of symbols is known to be incompressible. A syntactically valid program in a given language out to be mappable to a string, one valid syntactic statement at a time. Thus a random syntactically valid program is mappable to a random string and so is incompressible.
Indeed we do. However, hitting a measure zero set is not easy, and any deviation from it lands you back in the poorly compressible or incompressible space, hence the pervasive bugs in all code, without exception, bugs you can only find by actually running the code. An ambitious program of only writing correct code (e.g. https://dl.acm.org/doi/10.1145/800027.808459) remains an elusive goal, probably because the aim is not achievable, though one can certainly reduce the odds of a program taking off into unintended and incompressible directions quite a lot, by employing good software development techniques.
Often a comment thread will wander to a topic that has no bearing on the OP. Has that happened here?
Does your most recent comment have any relevance to how much hope we humans should put in the fact that an AI cannot know for sure whether its sensory data has been faked?