Question about self modifying AI getting “stuck” in religion
Hey. I’m relatively new around here. I have read the core reading of the Singularity Institute, and quite a few Less Wrong articles, and Eliezer Yudkowsky’s essay on Timeless Decision Theory. This question is phrased through Christianity, because that’s where I thought of it, but it’s applicable to lots of other religions and nonreligious beliefs, I think.
According to Christianity, belief makes you stronger and better. The Bible claims that people who believe are substantially better off both while living and after death. So if a self modifying decision maker decides for a second that the Christian faith is accurate, won’t he modify his decision making algorithm to never doubt the truth of Christianity? Given what he knows, it is the best decision.
And so, if we build a self modifying AI, switch it on, and the first ten milliseconds caused it to believe in the Christian god, wouldn’t that permanently cripple it, as well as probably causing it to fail most definitions of Friendly AI?
When designing an AI, how do you counter this problem? Have I missed something?
Thanks, GSE
EDIT: Yep, I had misunderstood what TDT was. I just meant self modifying systems. Also, I’m wrong.
This is the same problem as
“I convinced this arbitrary agent it was utility maximising to kill itself instantly”
which can take down any agent. The thing is to create an agent which does well in most situations (‘fair cases’, in the language of the TDT document.) Any agent can be defeated by sufficiently contrived scenarios.
But this way, it’s more likely to lie and decieve you in the short term.
“I convinced this GAI, which was trying to be Friendly, that it could really maximize its utility by killing all the humans.”
Post upvoted for last comment, as lots of thinking with a “yep, I’m wrong” at the end is entirely to be encouraged :-D
Is there anything here that requires TDT in particular? It seems more like an application of Pascal’s Wager to self-modifying decision agents in general.
Anyway, the part I’d dispute is ”...and the first ten milliseconds caused it to believe in the Christian god”. How is that going to happen? What would convince a self-modifying decision agent, in ten milliseconds, that Christianity is correct and requires absolute certainty, and that it itself has a soul that it needs to worry about, with high enough probability that it actually self-modifies accordingly? (The only thing I can think of is deliberately presenting faked evidence to an AI that has been designed to accept that kind of evidence… which is altogether too intentional a failure to be blamed on the AI.)
The point of the ten milliseconds is that the AI doesn’t know much yet.
Yes, you have a point. I’m pretty much answered.
My main point is that if Christianity has 50% certainty, the ratinal decision is to modify yourself to view it with 100% certainty. Take it as Pascal’s wager, but far more specific.
And yeah. It doesn’t need TDT, on second thoughts. However, that’s the first place I really thought about self modifying decision agents.
Christianity would not be assigned 50% probability, even in total ignorance; 50% is not the right ignorance prior for any event more complicated than a coin flip. An AI sane enough to learn much of anything would have to assign it a prior based on an estimation of its complexity. (See also: the technical explanation of Occam’s Razor.)
I suspect any agent can be taken down by sufficiently bad input. Human brains are of course horribly exploitable, and predatory memes are quite well evolved to eat people’s lives.
But I suspect that even a rational superintelligence (“perfectly spherical rationalist of uniform density”) will be susceptible to something, on a process like:
A mind is an operating system for ideas. All ideas run as root, as there are no other levels where real thinking can be done with them. There are no secure sandboxes.
New ideas come in all the time; some small, some big, some benign, some virulent.
A mind needs to process incoming ideas for security considerations.
Any finite idea can be completely examined for safety to run it in finite time.
But “finite time” does not necessarily imply feasible time.
Thus, there will likely be a theoretical effective attack on any rational agent that has to operate in real time.
Thus, a superintelligent agent could catch a bad case of an evolved predatory meme.
I do not know that the analogy with current computer science holds, I just suspect it does. But I’d just like you to picture our personal weakly godlike superintelligence catching superintelligent Scientology.
(And I still hear humans who think they’re smart tell me that other people are susceptible but they don’t think they would be. I’d like to see reasoning to this effect that takes into account the above, however.)
Edit: I’ve just realised that what I’ve argued above is not that a given rational agent will necessarily have a susceptibility—but that it cannot know that it doesn’t have one. (I still think humans claiming that they know themselves not to be susceptible are fools, but need to think more on whether they necessarily have a susceptibility at all.)
There’s no reason for this to be true for an AI. However, I also don’t see why this assumption is necessary for the rest of your argument, which is basically that an agent can’t know in advance all the future ramifications of accepting any possible new idea or belief. (It can know it for some of them; the challenge is presumably to build a good enough AI that can select enough new ideas that it can formally prove things about to be useful, while rejecting few useful ideas as unsusceptible to analysis.)
One question I’m not sure about—and remember, the comment above is just a sketch—is whether it can be formally shown that there is always a ’sploit.
(If so, then what you would need for security is to make such a ’sploit infeasible for practical purposes. The question in security is always “what’s the threat model?”)
For purposes of ’sploits on mere human minds, I think it’s enough to note that in security terms the human mind is somewhere around Windows ’98 and that general intelligence is a fairly late addition that occasionally affects what the human does.
There isn’t always an exploit, for certain classes of exploits.
For instance, when we compile a statically checked language like Java, we guarantee that it won’t take over the VM it’s executing in. Therefore, it won’t have exploits of some varieties: for instance, we can limit its CPU time and memory use, and we can inspect and filter all its communications with any other programs or data. This is essentially a formal proof of properties of the program’s behavior.
The question is, can we prove enough interesting properties about something? This depends mostly on the design of the AI mind executing (or looking at) the new ideas.
As I’ve noted, my original comment isn’t arguing what I thought I was arguing—I thought I was arguing that there’s always some sort of ’sploit, in the sense of giving the mind a bad meme that takes it over, but I was actually arguing that it can’t know there isn’t. Which is also interesting (if my logic holds), but not nearly as strong.
I am very interested in the idea of whether there would always be a virulent poison meme ’sploit (even if building it would require infeasible time), but I suspect that requires a different line of argument.
I’m not aware of anything resembling a clear enough formalism of what people mean by mind or meme to answer either your original question or this one. I suspect we don’t have anywhere near the understanding of minds in general to hope to answer the question, but my intuition is that it is the sort of question that we should be trying to answer.
I don’t really see what this has to do with TDT in particular. SFAICT, this is just about self-modification, which is really independent of that.
I suspect that the belief in belief is more attractive to mammalian brains than to AI. We have a lizard brain wrapped in an emotional brain wrapped in a somewhat general purpose computer (neocortex). Intellectual humans are emotional beings trying hard to learn effective ways to program the neocortex. We have to do a lot of wacky stuff to make progress, and things we try that make progress in one direction, like a belief in belief, will generally cost us on progress in other directions, like maintaining an openness appropriate to the fact that our maps are astoundingly less complex than the territory being mapped.
But the AI is the rationalists baby, and we obsess over preserving as much of the baby while tossing as much of the bathwater as possible. Sure, we don’t want the baby to make the same mistakes we made, but we want it to be able to exceed us in many ways, so we necessarily leave open doors behind which we know not what lies. I imagine this is what the gods were thinking about when they decided to imbue their AI’s with free will.
More than a belief in belief trap for our own constructions, I would worry about some new flaw appearing. If the groups around here have any say, there will probably be a bias toward Bayesian reasoning in an AI which will likely be poorly suited for believing in belief. But the map is not the territory, and Bayesian probability is a mapping technique. Who can even imagine what part of the territory the Bayesian mappers will miss entirely or mismap systematically?
Finally, a Christian AI would hardly be a disaster. There’s been quite considerable intellectual progress made by self-described Christians. Sure, if the AI gets into an “I’ve got to devote all my cycles to converting people” loop, that sort of sucks, but even in bio intelligences that tends to just be a phase, especially in the better minds. The truth about maps and territories seems to be that no matter how badly you map Russia, it is your map of America that determines your ability to exploit that continent. That is, people who have a very different map of the supernatural than us have shown no consistent deficit in developing their understanding of the natural world. Listen to a physicist tell you why he believes that a Grand Unified Theory is so likely, or contemplate the very common belief that the best theories are beautiful. We rationalists believe in belief and have our own religions. We have to to get anywhere because that is how our brains are built.
Maybe our version of AIs will be quite defective, but better than our minds, and it will be a few generations after the singularity that truly effective intelligences are built. Maybe there is no endpoint until there is a single intelligence using all the resources of the universe in its operation, until the mapper is as complex as the universe, or at least as complex as it can be in this universe.
But that sounds like at least some aspects of the Christian god, doesn’t it.
Most non-hedonistic goals can be thought of as being like religions.
So: the problem in this area is not-so-much getting religion, but losing it—as often happens in practice when intelligent agents get on the internet and encounter sources of unbiased factual information.
If we want the machines we build to keep their human-loving “religion”—and continue to celebrate their creators—then they had better not be too much like us in this respect.