Some thoughts on AI, Philosophy, and Safety
I’ve spent some time over the last two weeks thinking about problems around FAI. I’ve committed some of these thoughts to writing and put them up here.
There are about a dozen real posts and some scraps. I think some of this material will be interesting to certain LWers; there is a lot of discussion of how to write down concepts and instructions formally (which doesn’t seem so valuable in itself, but it seems like someone should do it at some point) some review and observations on decision theory, and some random remarks on complexity theory, entropy, and prediction markets.
I just read your latest post, Confronting Incompleteness, and don’t understand it.
1) It’s a simple corollary of Gödel’s theorems that there can be no machine that prints out all true statements about the “standard integers” but doesn’t print out any false statements. Therefore, for any agent A that prints only true statements, there is some agent A’ that prints the same statements plus one more hardcoded statement that A doesn’t print. (ETA: for example, it could be the statement “A doesn’t print this statement.”) So your game doesn’t have a best strategy, and the best we can hope for is a pretty good strategy. Do you believe you have a strategy that’s better than just printing out all the logical consequences of some formal system?
2) I’m not sure humans can feel pretty confident that “anything that humans feel pretty confident about is true”. To believe such a statement, you must first understand it, and on closer inspection it turns out to be self-referencing in a weird way. That weirdness can be bootstrapped into a contradiction via Löb’s theorem.
Fix A, say make it an emulation of your brain.
What should A do? A doesn’t get to step outside of itself and print a statement “A doesn’t print this statement” without the statement becoming false.
But A can do better than just printing out the consequences of a formal system. It can print out the theorems in a formal system, and then add on as many Con(...) statements as it can think of, as Nesov remarks. However, it can do even better still by outputting Con(A), even though it cannot then output all of the logical consequences of Con(A). It will then go on to output some other statements, and in general the behavior will be extremely complicated.
A program is not a formal system, at least not trivially, you need to specify an interpretation that sees it or its part so. If a program outputs a Con(A) statement, why not form a formal system B=A+Con(A) and have the agent use B? It’s better than using just A and outputting a single additional Con(A) without consequences. What’s the point of not extending formal systems with new things agent could output, and instead generating them in some different ad-hoc way?
Sorry, in the post I introduced Con(A) [where A is a program which outputs statements ] to mean:
“For no statement S does A output both S and not S.”
(This seems like the natural translation to the type “program.”) Hopefully now this is clear?
That part is now clear, though that means A is not a formal system (its output is not closed under logical inference rules), and instead of Con(A) you could output all kinds of other things. It’s not clear what motivates this construction as compared to countless others.
I am not really trying to endorse one construction or another.
The point is: if you are playing the game “say as many true things as you can without messing up” and you are good at it, you don’t end up saying a set of things which is closed under logical inference.
It is quite possible that this is consistent with your intuitions, but it wasn’t consistent with mine until making this observation (and I would guess that most mathematicians would guess that you should say all logical implications of anything you say).
Right. So your point is to antipredict the assumption that if agent is generating statements, then it’s a formal system. The argument is to point out that there seems to be no clear sense in which being a formal system is optimal, since there are lots of strategies that break this property, so a good agent probably doesn’t just parrot a complete list of some formal system’s output. I agree, though seeing agent’s output as logical statements at all still doesn’t seem like a natural or useful interpretation. I guess my next question would be about motivation for considering that setting.
I tend to think about and talk about agents’ “beliefs,” and apply various intuitions about my own beliefs to decision theoretic problems; this setting is designed to better inform some of those intuitions (in particular, it shifts the border on some conflicts between my naive intuitions and incompleteness results).
This didn’t clarify the situation for me.
Also, an infinite sequence of Con(...) statements can be included in the theory, it’s not something requiring a new construction...
An interesting question suggested by the post is what true statements about itself can an agent produce in that setting, assuming it can control which statements to withhold (for how long or under what conditions), even if it knows them to be true.
(There seems to be a sense in which an agent builds the axioms of the formal system that it has access to by making decisions and observations. That is, we can reason about the agent in a fixed system, but the system that the agent itself knows is revealed to it only gradually, depending on what the agent decides and observes at each step, based on the formal system it has at the moment. In this sense, in some cases the agent won’t be withholding statements despite knowing them to be true, it just won’t know them to be true yet (in a stronger sense than not having found consequences of a formal system it has; instead it doesn’t have the formal system that has them as consequences), but will figure them out later.
This view seems to be connected with free will-as-incompleteness demonstrated by the chicken rule: actions are independent of the given formal system, but the agent should be able to learn what the action was after actually performing it, thus extending the original formal system.)
Gödel’s theorems apply to effectively generated theories, which means precisely that there is a program printing out all the axioms (or, equivalently, all the provable theorems). So even if an agent is very crafty about this sort of control, it can’t become exempt from Gödel. And I don’t see why it would produce something much stronger than our existing axiomatic systems. So why bother?
To capture some aspects of theory-building and incompleteness associated with decision-making. I agree that all of agent’s activity could just be captured in a stronger system that anticipates all of its decisions from the start, but agent’s own decision-making seems to be modeled best as using weaker theories that can’t predict the next decisions; and after each decision is made, it should become known to plan further.
So the reason why the question is interesting is not because it leads to stronger theories, but because it’s concerned with capturing the effect of decisions on theories involved in making them.
If this approach doesn’t lead to stronger theories, then the game described in Paul’s post doesn’t seem to be strong motivation for studying it. Can you think of some other game where it would be useful?
Sure, not a strong motivation, that’s why I said “question suggested by the post” (we are talking of two different questions now...). The second question seems to be mainly about interpreting agent’s activity, not directly about construction of agents, so I don’t expect a game that clearly requires this kind of analysis. (For example, all agents are “just programs” (perhaps with “just oracles”), so any talk of formal systems used by agents is something in addition, a method of understanding them, even if it results in useful designs.)
On the other hand, if we want the agent to remain ignorant of its own actions, adding axioms that decide the actions to agent’s own theory from the very start is not an option, so describing a procedure for adding them later is a way of keeping both of the requirements: a strong theory and consistency of counterfactuals at each decision.
If the agent decides to print any statement at all, that’s probably because it follows from some formal system, right? Then there seems to be no harm in printing all statements that follow from that formal system.
I really like your post about hazards which seems to be the most important one in the sequence. You examined an obvious-looking strategy of making an AI figure out what “humans” are, and found a crippling flaw in it. But I don’t understand some of your assumptions:
1) Why do you think that humans have a short description, or that our branch of the multiverse has a short pointer to it? I asked that question here. In one of the posts you give a figure of 10000 bits. I have no idea why that would be enough, considering the amount of quantum randomness that was involved in human evolution. I don’t even know any argument why the conditional K-complexity of one big human artifact given a full description of another big human artifact has to be low.
2) Why do you think a speed prior is a good idea for protection against the attacks? It doesn’t seem likely to me that an actual human is the most computationally efficient predictor of a human’s decisions. If we allow a little bit of error, an enemy AI can probably create much more efficient predictors.
(10000 was an upper bound no the extra penalty imposed on the “global description” if we use Levin rather than Kolmogorov complexity; the point was that 10000 bits is probably quite small compared to the K-complexity, especially if we rule out global descriptions.)
I have an intuitive sense of how much surprise I have had across all of the observations I have made of the world, and this seems to be an upper bound for the negative log probability assigned by the distribution “things embedded in our universe.” If our universe has a short description, then this suggests that specifying “things embedded in the universe” and then drawing from among all of them contributes a lot to the universal prior mass on an MRI of your brain (it is much harder to say the same for Finnegan’s wake, or any other small object, since my intuitive bound is only interesting when applied to a large fraction of all of a humans’ experiences). I suspect this dominates, but am not sure.
The question is whether an enemy AI is a better predictor than the best prediction algorithm of similar complexity; of course there are AIs who are good simulators, but in general they can’t hijack the universal prior because they need to be told what they should try to simulate, and it would be cheaper at that point just to cut out the AI (though there are exceptions; I discuss one, which obtains the necessary info about the human by simulating the universe and which is therefore significantly injured by the complexity penalty)
Also note that the speed prior doesn’t seem to help you very much, but I think a space-bounded version does rule out universe simulators and a few other difficulties.
Excellent. Thanks!
I sure hope you have a tendency to eventually converge to something that makes sense to me… Do you agree that what you post there is the product of an “initial exploration” phase that would get significantly revised and mostly discarded on the scale of months? (I had a blog just 1.5 years ago that I currently see this way, but didn’t at the time...)
Have you seen Paul’s latest post yet? It seems much more well formed than his previous posts on the subject.
I left a comment there, but it’s still under moderation, so I’ll copy it here.
This seems like a problematic part of the argument. The reason we think torturing humans would be bad according to U is that we have an informal model of humans in our mind, and we know that U is actually a simulation of something that contains a human. Our “suspicion” does not come from studying U as a mathematical object, which is presumably all that a U-maximizer would do, since all it has is a formal definition of U and not our informal knowledge of it.
I agree, though it doesn’t go as far afield as many of the other posts. It’s actually another plausible winning scenario that I forgot about in the recent discussions: implement WBE via AGI (as opposed to normal engineering route, thus winning the WBE race), and then solve the remaining problems from within. Might be possible to implement when the FAI puzzle is not yet solved completely.
Could you clarify your remarks? This seems to be a source of persistent mild disagreement, but I’m not really sure what it is. I am aware of some inferential distance between us on what seem to me to be technical aspects of decision theory, but your comments either require some misunderstanding or some other not-yet-identified inferential chasm.
In the future I expect to feel basically the same way about this writing, particularly the stuff in the category “Formal Definitions,” as I do today about these posts: not safe for use, but important for someone to think about and describe, if only to see more exactly why they are dangerous approaches. I expect the formal assertions I’ve made, to the extent I’ve made formal assertions, to continue to look reasonable. I am open to the possibility that there may be surprises.
(For example, when making those old LW posts I didn’t yet understand exactly how weird TDT agents’ behavior might look to someone used to thinking in terms of CDT, so while the cryptographic boxing stuff still holds up fine the manipulation of boxed AIs doesn’t; the new work can be expressed with much less wiggle room, but there may still be surprises. )
I have to tweak the questions I am asking academics about risks from AI. Instead of asking, “do you think risks from AI are to be taken seriously?”, I should ask, “do you think that possible interferences from alternative Everett branches in which a singularity went badly could enable an unfriendly AI to take over the universe?”. I might increase the odds of being taken seriously that way...but I’ll have to think that matter over first.