Death Note, Anonymity, and Information Theory
I don’t know if this is a little too afar field for even a Discussion post, but people seemed to enjoy my previous articles (Girl Scouts financial filings, video game console insurance, philosophy of identity/abortion, & prediction market fees), so...
I recently wrote up an idea that has been bouncing around my head ever since I watched Death Note years ago—can we quantify Light Yagami’s mistakes? Which mistake was the greatest? How could one do better? We can shed some light on the matter by examining DN with… basic information theory.
Presented for LessWrong’s consideration: Death Note & Anonymity.
Nice analysis! I really like the way you quantified it.
This is off-topic, but I think Death Note is practically begging to be re-written as rationalist fanfiction: Light’s manipulation skills could be used to discuss psychology and cognitive biases (much like Draco Malfoy in HP:MOR). L would of course be a Bayesian rationalist, and Soichiro and Izowa could be Traditional Rationalist foils who would allow L to explain the ins and outs of high-level rationality. (As you’ve shown here, information theory could play a larger role in L’s investigation.) The cat-and-mouse games between L and Light could be turned into decision theory problems; the rules related to ownership of the notebook could be used to explore timeless reasoning (much like the film Memento). The story is already brimming with ethical questions, and both L and Light’s internal monologues could be used to discuss consequentialism and utilitarianism. I’m not sure what could be done with the rest of the characters or how the supernatural aspect would be handled, but it would probably be an interesting read.
Eliezer’s Timless Decision Theory is interesting, but I don’t yet understand what real problem it is solving. He worked it into his Harry Potter fanfic when Harry was dealing with Azkaban (sp?), and given that he wrote the paper I can see why he made use of it, but here’s someone else saying it’s useful.
Does timeless reasoning have any application in situations where your opponents can’t read your mind?
(I haven’t watched Death Note, so my apologies if the answer to this question is obvious to people who have.)
We all read each other’s minds to some extent, and to the extent this happens, TDT will give better advice than CDT. See section 7 of the TDT paper:
One reason reason is that it seems like it might be helpful with friendliness proofs, particularly the part where you have to prove the AI’s goal will remain stable over millions of self-modifications (the harder, and all too frequently ignored, side of the problem). Basically, it takes dilemma’s which might otherwise tempt an AI to self-modify, and shows that it need not have to.
I think with CDT you can prove an AI won’t need to modify its goal system on action-determined problem, while with TDT you can prove the same for the broader class of decision-determined problems. This leaves many issues, but its a step in the right direction.
Disclaimer: The above post should not be taken to speak for Eliezer Yudkowsky, SIAI, or anyone other than me. I am not in any way a member of SIAI or any other similar organization. There is a good chance that I am talking out of my arse.
What’s the easier side?
Figuring out what the goal should be (note, I said easier, not easy). You probably know more than I do but the way I see it the whole thing breaks down into a philosophy problem and a maths problem. Most people find philosophy more fun than maths so spend all their time debating the former.
I’m not clear on the action-determined vs. decision-determined distinction. Can you give an example of a dilemma that might tempt an AI to self-modify if we didn’t build it around TDT?
In general, I’m nervous around arguments that mention self-modification. If self-modification is a risk, then engineering in general is a risk, and self-modification is a special case of engineering. So IMO an argument about Friendliness that mentions self-modification immediately needs to be generalized to talk about engineering instead. Self-modification as a fundamental concept is therefore a useless distraction.
The classic is Parfit’s hitch-hiker, where an agent capable of accurately predicting the AI’s actions offers to give it something if and only if the AI will perform some specific action in future. A causal AI might be tempted to modify itself to desire that specific action, while a timeless AI will simply do the thing anyway without needing to self-modify.
As for your second problem, Yudkowsky himself explains much better than I could why self-modification is important in the 3rd question of this interview.
Roughly, the importance is that there’s only two kinds of truly catastrophic mistakes that an AI could make, mistakes which manage to wipe out to whole planet in one shot and errors in modifying its own code. Everything else can be recovered from.
That works if the AI knows that the other agent will keep its promise, and the other agent knows what the AI will do in the future. In particular the AI has to know the other agent is going to successfully anticipate what the AI will do in the future, even though the AI doesn’t know itself. And the AI has to be able to infer all this from actual sensory experience, not by divine revelation. Hmm, I suppose that’s possible.
Hmm, it’s really easy to specify a causal AI, along the lines of AIXI but you can skip the arguments about it being near-optimal. Is there a similar simple spec of a timeless AI?
When I think through what the causal AI would do, it would be in a situation where it didn’t know whether the actions it chooses are in the real world or in the other agent’s simulation of the AI when the other agent is predicting what the AI would do. If it reasons correctly about this uncertainty, the causal AI might do the right thing anyway. I’ll have to think about this. Thanks for the pointer.
It could build and deploy an unfriendly AI completely different from itself.
That’s the thing about mathematical proofs, you need to conclusively rule out every possibility. When dealing with something like a super-intelligence there will be unforeseen circumstances, and nothing short of full mathematical rigour will save you.
I don’t know of one off-hand, but I think AIXI can easily be made Timeless. Just modify the bit which says roughly “calculate a probability distribution over all possible outcomes for each possible action” and replace it with “calculate a probability distribution over all possible outcomes for each possible decision”.
This may be worth looking into further, I havn’t looked very deeply into the literature around AIXI.
This looks like you might be stumbling towards Updateless Decision Theory, which is IMHO even stronger than TDT and may solve an even wider range of problems.
I could come up with an argument for this falling into either category.
I’m claiming that the concept of self-modification is useless since it’s a special case of engineering. We have to get engineering right, and if we do that, we’ll get self-modification right. I’m struggling to interpret your statement so it bears on my claim. Perhaps you agree with me? Perhaps you’re ignoring my claim? You don’t seem to be arguing against it.
The scenario I proposed (creating a new UFAI from scratch) doesn’t fit well into the second category (self-modification) because I didn’t say the original AI goes away. After the misbegotten creation of the UFAI, you have two, the original failed FAI and the new UFAI.
Actually, the second category (bad self-modification) seems to fit well into the first category (destroying the planet in one go), so these two categories don’t support the idea that self-modification is a useful concept.
Okay, I think I see what you mean about engineering and self-modification, but I don’t think its particularly important, it appears you’re thinking in terms of two concepts:
Self-modification: Anything the AI does to itself, for a fairly strict definition of ‘itself’, as in, the same physical object’ or something like that.
Engineering: Building any kind of machine.
However, I think that when most FAI researchers talk about ‘self-modification’ they mean something broader than your definition, which would include building another AI of roughly equal or greater power but would not include building a toaster.
Any mathematical conclusions drawn about self-modification should apply just as well to any possible method of doing so, and one such method is to construct another AI. Therefore constructing a UFAI falls into the category of ‘self modification error’ in the sense that it is the sort of thing TDT is designed to help prevent.
Sorry, I don’t believe you. I’ve been paying attention to FAI people for some time and never heard “self-modification” used to include situations where the machine performing the “self-modification” does not modify itself. If someone actually took the initiative to define “self-modification” the way you say, I’d perceive them as being deliberately deceptive.
You’re being overly literal.
I have seen SIAI affiliated people on Less Wrong arguing that self modification is impossible to prevent by pointing out that even if you include an injunction against rewriting its own source-code would not prevent it from building something else.
Self modification as you describe it is a useless mathematical concept for Friendliness, as is engineering. Worse, it is not even well-defined, if an AI copies itself onto another computer, and alters the copy, is that self-modification? If it modifies itself, but keeps a copy of its old code around, is that self-modification? Where do you draw the line between the two?
You are violating the principle of charity by assuming that interpretation that makes them look worse.
Mostly when SIAI people talk about self-modification they imagine a machine that just goes in and edits its own source code because that is presumably the most efficient way to self modify and the one that most AI’s would use. This does not mean the ‘builds another AI’ is not included, but it seems like a very stupid and inefficient way to go about things, so you are wasting your time by worrying too much about it.
I’ll bet you £100 that whatever conclusions the SIAI eventually draws about self modification will apply just as well to all kinds, I really cannot see how a silly distinction like the one you are making would find its way into a mathematical proof.
We’re certainly agreed on that. I’m willing to go further—I believe any mathematical conclusions that apply to self-modification (your definition) will apply to all possible actions. I don’t think your definition carves out a part of the world that has any usefully special properties.
Agreed.
I don’t think your definition is well-defined either. Where’s the important line between self-modification and making a toaster?
We appear to have no useful definition for the word. Time to stop using it, IMO.
I disagree. “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify” is true “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to do anything” is false.
I’m not a hundred percent clear on this, and I’ll be the first to admit that this is a problem and needs to fixed before the problem can be solved. From a very brief period of thought it seems to me a good line to draw is the point at which the new agent becomes more powerful, in the sense of optimization power, than the old one.
I think the word points to something, and I have a feeling that something is the heart of the problem. Interestingly, in terms of mathematical decision theory self-modification seems quite well defined.
After some heat, we’re starting to get light. This is good.
I’m not sure that’s true. Imagine I’m an ideal CDT. I am in North America. If I wish to react to something that happens in China, there will be some lag. If I could deal with the situation better when there is no lag, I would benefit from cloning myself and sending a copy to China. Would that be self-modification?
(This presupposes that I have access to materials sufficient to copy myself. That might not be true, depending on whether an ideal CDT is physically realizable.)
I should probably have specified that building another agent doesn’t really count as self modification if the other agent is identical to the original (or maybe it does count as self modification, but in a very vacuous sense, the same way ‘do nothing’ is technically an algorithm). So if the other agent is CDT this is not a counter-example.
If the other agent is a more primitive approximation to a CDT then I would view constructing it not as self-modification, but simply as making a choice in an action-determined problem.
If the other agent is TDT or UDT or something then this may count as self-modification, but there is no need to make it this way.
Suppose we use the rigorous definition where an action-determined problem is just a list of choices, each of which leads to a probability distribution across possible outcomes, each of which has a utility assigned to it. In this case I think it is clear that “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to self modify” is true while “An ideal CDT agent that anticipates facing only action-determined problems will always choose not to do anything” is false.
That’s plausible, but my counterexample still holds, apparently. I’m sure the desired theorem is true under the right hypotheses, but I can’t quite guess what they are right now.
In the cloning scenario, Tim-in-China would have to be a modified version of Tim-in-US. Tim-in-US is optimizing for a utility function U of the environment which perhaps can only be evaluated based on information available to Tim-in-US. Tim-in-China would be constructed to optimize for the best estimate of U it can make, given that it’s in China. This best estimate will be different from U. If everything important happens in China and needs quick responses, and Tim-in-US can’t move, it might even be worthwhile for Tim-in-US to sacrifice himself to create Tim-in-China.
Tim-in-China is clearly a self-modification, since the utility function is different, right?
In general, we can contrive the circumstances so the agent is paid to self-modify. If the agent is rational and it’s paid enough to self-modify, it will.
There’s no reason for this. A true CDT doesn’t need to see the results of its actions, it just needs to predict them. Since its an ideal Bayesian, it should be quite good at this. Tim in China might acquire new information the Tim in US didn’t know, causing it to revise its probability distribution, but it would not change its utility function. Nor would it cease to be a CDT, which means in practice it would not self-modify.
Also, strictly speaking, prior to the point where Tim in China is created the problems are not fully action-determined, since the outcome is affected by things other than random chance the the choices made by Tim in US.
Heat but no light this time around. I won’t reply more unless it gets better.
The world in which Tim-in-US lives determines what options are available when creating Tim-in-China, not any property of CDT, so if I’m creating the scenario I can fill in the details so there is reason for Tim-in-China to be lame in any way I choose. It could be very simple—Tim-in-US has a button to push that will both destroy Tim-in-US and set Tim-in-China into action, where Tim-in-China existed at the beginning of the scenario and is therefore whatever I want it to be. Tim-in-US cannot take any direct action other than pushing the button. Pushing the button is self-modification. If we can contrive for it to be rational for Tim-in-US to push the button, Tim-in-US will self-modify.
In a more realistic scenario, Tim-in-China might be imperfect because it is built of whatever materials are at hand, rather than the mathematically perfect substrate Tim-in-US’s mind runs on. If you want Tim-in-China to be an ideal CDT for it to qualify as self-modification, then fine, Tim-in-China is an ideal CDT but the environment constrains things so that Tim-in-China’s utility function is not a particularly good approximation to that of Tim-in-US. If Tim-in-China’s utility function is good enough, and Tim-in-US’s ability to take direct action is impaired enough, then we can fill in the details so Tim-in-US will still benefit from self-modifying.
I can’t make sense of this. Please tell me the influence on the outcome that wasn’t random chance and wasn’t a choice made by Tim-in-US. (We don’t need any randomness in this scenario.) You’ll also have to choose something that leads to it not being action-determined, and something that’s consistent with a definition of action-determined that doesn’t lead to “action-determined” referring a useless or empty set of possibilities.
You might be referring to actions taken by Tim-in-China. Tim-in-US chose to create Tim-in-China, so all actions taken by Tim-in-China are a consequence of choices made by Tim-in-US.
The thing is there’s two ways of looking at this problem. Either creating Tim-in-China is just one option avaliable in an action-determined, everything he does is just a consequence which Tim-in-US predicted. In this case it isn’t self-modification. Alternatively, he is an independent agent, in which case creating him is self-modification but the problem isn’t action-determined.
I think I’m beginning to see that you’re right, self-modification isn’t a strictly defined concept. On the other hand, very few things are strictly defined, ‘human’ and ‘AI’ are certainly not but we wouldn’t be wise to ignore them when solving Friendliness.
It is possible to set up mathematical models in which self-modification is well defined (in the same way that atoms aren’t fundamental physical entities, but we can set up models in which they are and those models are useful). The basic idea is an agent is given a problem of some type, but prior to the problem we offer it the chance to have the problem faced by another agent instead of itself, if there is any other agent for which it would say yes then it self modifies on this problem.
The set of real world strictly action-determined problems is empty, the concept is similar to that of an ideal straight line, it is a useful approximation not a real category.
The strict definition of action-determined problem is something like this:
agent comes into existence, out of nowhere, in a way the is completely uncaused within the universe and could not have been predicted by its contents
agent is presented with list of options
agent chooses one option
agent disappears
I think the last part may not be strictly necessary, but I’m unsure. The first is necessary, it is what separates action-determined problems from broader categories like decision-determined problems and identity-determined problems.
We seem to be agreed that it is possible to define mathematical situations in which self-modification has a well-defined meaning, and that it doesn’t have a well-defined meaning for an AI that exists in the real world and is planning actions in the real world. We don’t know how to generalize those mathematical situations so they are more relevant to the real world.
We differ in that I don’t want to generalize those mathematical situations to work with the real world. I’d rather discard them. You’d rather try to find a use for them.
I suppose clarifying all that is a useful outcome for the conversation.
Outside of ‘electron’, ‘quark’ , ‘neutrino’ almost none of the words we use are well-defined on the real world. All non-fundamental concepts break if you push them hard enough.
I think they are useful in that I have a pretty good idea of what I mean by ‘self-modification’ in the real world. For a simpler example, if I want to build a paperclipping AI, the sort of thing I’m looking to avoid is where for some reason my paperclipping AI starts making something pointless and stupid, like staples. I wish to study self-modification, because I want to stop it from modifying itself into a staple-maker. I may not know exactly what counts as self-modification, but the correct response is not to ignore it and say ‘oh, I’m sure it will all work out fine either way’.
Yes, making it rigorous will be difficult. Yudkowsky himself has said he thinks that 95% of the work will be in figuring out which theorem to prove. The correct response to a difficult problem is not to run away.
I’m not suggesting running away. I’m suggesting that the rigorous statement of the theorem will not include the notions of self-modification (my definition) or self-modification (your definition), since we don’t have rigorous definitions of those terms that apply outside of a counterfactual mathematical formalism.
You’re saying engineering is a special case of self-modification, and I’m saying that self-modification is a special case of engineering, so we seem to agree that they’re the same thing and we’re arguing about what to call it.
IMO “self-modification” is a misleading thing to call it, since you’ve defined the term to include constructing an entirely new AI. The AI doesn’t have a self, it’s just a collection of hardware.
However, I don’t like to debate definitions so I won’t belabor the point.
One individual used timeless reasoning to lose 100 pounds.
It does give different answers for problems like the Prisoner’s Dilemma when your opponent is similar enough to you that they will make the same decisions. As you mentioned, it makes an appearance in HP:MoR for similar reasons. There’s no obvious application to Death Note, but I think it could certainly be incorporated somehow. If you’ve seen the film Memento, you might have some idea of what I mean. (I don’t want to spoil Death Note because it really is an excellent anime series, so I’m not going to say exactly what I was thinking.) TDT is certainly not essential to rationality but it is very interesting, so it might be worth including in a Death Note re-write for that reason alone.
Even if they can’t read your mind with 100% accuracy, if they have some ability to predict your cognition, CDT will go astray—for example, in Parfit’s Hitchhiker.
Also, TDT allows co-operation in prisoners’ dilemmas with copies of yourself.
Good grief, please no. Don’t ruin yet another franchise.
Ruin? Or make better?
If this is an assertion that MoR made Harry Potter better, then I have to disagree with that.
Arguably, the biggest mistake Light made was one of abstract strategy: he started using the Death Note almost immediately after obtaining it. He should have spent many years testing the thing, pondering its implications, studying police work, etc, before putting his plan into action.
I can’t help but think that that represents a serious privileging of the hypothesis—given a little black notebook claiming such absurd powers, you shouldn’t carefully devise 20 different studies which try to falsify your various theories and inferences about its powers & limitations.
Unless you mean that after he verified that the Death Note did in fact kill supernaturally as claimed (after the biker and hostage-taker, I suppose), he should have gone into scientist mode?
In that case, my first thought is that from Light’s perspective, delay is massive waste (all those dead people murdered by people who should be dead, eg.) and he thought he could handle any challenges that came his way. Which he was almost right about, after all.
not as big a waste as getting caught. given the power to change the world one should carefully think about how this power could be taken away before you start doing low utility things like eliminating criminals.
Big DN fan, my thoughts:
1) Only a mistake if you consider his goal to be “kill as many people as possible” rather than “reduce crime as much as possible”, and for the latter the small loss of anonymity may well be a justified sacrifice for the deterrent effect he could achieve by exposing his own existence. Especially since, as you point out, he might well have been discovered anyway.
2) Yep, pretty big mistake there.
3) I think you slightly under-rate this one, by not considering that L can’t always eliminate people with certainty, prior to this it would have been possible that Kira was not Japanese but was timing his kills to make it look like he was to lead the police awry. This test made that hypothesis a lot less likely.
4) Agreed, this is the big screw-up, also probably the one that most of the viewers could have been expected to spot.
5) Bear in mind he was actually quite careful to prevent Penbar from being singled out, although he could have done better by delaying all the killings for a week or so. Misora would have narrowed down his anonymity even more had she not been killed.
For his optimal strategy, might he not have been even better off by deliberately sending misleading information, by timing the killings to indicate he lived somewhere else for example? After all, applying your strategy might well narrow it down to ‘people who know information theory’ which probably costs quite a few bits.
I more or less agree with you on point 1. A rational person could have reasoned in that way. But I think we have to say that Light did not. He wanted people to recognize his work when it came to killing apparent criminals because he wanted admiration as a goal in itself. This led to the most obviously avoidable mistake, #3.
I disagree, even in the very first episode he specifically outlines that part of his plan is that when people notice criminals are dying they will be less inclined to become criminals.
I wouldn’t say #3 was that easily avoidable, I didn’t see it coming myself, while in #4 it was all I could do to restrain myself from yelling ‘idiot!’ at the screen.
Yes, I believe that on the level of explicit reasoning he wants to kill criminals with heart attacks to deter crime (and use deaths of other kinds to secretly dispose of people who he thinks don’t contribute). Then he gets agitated and kills Lind with a heart attack before verifying that he needs to kill Lind at all. This supports the theory that (like any other fascist dictator) he wants admiration and obedience more than a better world.
But… but… Light actually won, didn’t he? At least in the short run—he managed to defeat L. I was always under the impression that some of these “mistakes” were committed by Light deliberately in order to lure L.
You think Light won? Gosh, you need to read my other essay then, Death Note Ending and especially the final section, http://www.gwern.net/Death%20Note%20Ending#who-won
When you talk about the number of bits of anonymity he has once it’s been narrowed down to Kanto, shouldn’t that be the male population of Kanto?
Edit: The section about comparing mistakes also seems somewhat contradictory; first you talk about the number of people excluded (and so the first bit is, by definition, the most valuable) and then by the number of bits (and so the 11 bit mistake is more important than the 1.6 bit mistake). It may help to resolve the tension between the two approaches more explicitly.
Yes, you’re right—I used the total population of Kanto, not the total male population. I should probably rejigger those numbers.
EDIT: OK, I think I fixed that specific error. Fortunately, the mistake had only contaminated a few numbers… I think. Please tell me if I’ve accidentally introduced additional inconsistencies!
I believe I did do this before your comment, in mistake 3 where I discuss what the logarithmic scale buys us.
In general, it should take L about the same amount of work, in a Bayesian sense, to gather one more bit of information regardless of how many he currently has. Thus, quantifying Light’s mistakes in terms of bits conceded is probably the best way to do it.
Have you seen the live-action movie version of Death Note? The pair of two-hour movies cover roughly the first season of the anime, but they have a different ending, one inspired by a popular fan theory...
I watched one of the Death Note movies, but I really can’t remember anything about them except L killing himself with a delayed Death Note sentence, or something like that, and how horrible the CGI Ryuk looked.
are they worth watching quality wise?
I liked them; I’ve never read the manga or watched the anime, so I can’t say which version is best.
I like Death Note, but I found “Liar Game” to be more realistic—at least I personally learned more psychology from it. What do you guys think?