Thoughts after the Wolfram and Yudkowsky discussion

I recently listened to the discussion between Wolfram and Yudkowsky about AI risk. In some ways this conversation was tailor-made for me, so I’m going to write some things about it and try to get it out in one day instead of letting it sit in my drafts for 3 weeks as I tend to do. Wolfram has lately obsessed over fundamental physics, which is a special interest of mine. Yudkowsky is one of the people thinking most carefully about powerful AI, which I think will kill us all, and I’d like to firm up that intuition. Throw them on a podcast for a few hours, and you have my attention.

That said, for the first hour I was just incredibly frustrated. Wolfram keeps running down rabbit holes that were basically “aha! You haven’t thought about [thing Yud wrote ten thousand words on in 2008]!” But a miracle happens somewhere in the second hour and Wolfram is asking actually relevant questions! His framework of small accidental quirks in machine learning algorithms leading to undesired behavior later was basically an actual issue. It was kind of a joy listening to two smart people trying to mutually get on the same page. Wolfram starts out bogged down in minutia about what ‘wanting’ is and whether it constitutes anthropomorphism, but finally finds a sort of more abstract space about steering to goals and trying to see Yudkowsky’s point in terms of the relative dangers of sections of the space of goals under sufficient optimization. The abstraction was unfortunate in some ways, because I was interested in some of the minutia once they were both nearly talking about the same thing, but also, if Wolfram kept running down rabbit holes like “actually quarks have different masses at different energy scales” when Yudkowsky said something like “the universe runs on quarks everywhere all at once no matter what we think the laws of physics are,” then they were never going to get to any of the actual arguments. That said, I don’t see how Wolfram got to anything close to the actual point at all, and maybe the rabbit holes were necessary to get there.

My impression was that Yudkowsky was frustrated that he couldn’t get Wolfram to say, “actually everyone dying is bad and we should figure out whether that happens from our point of view.” There was an interesting place where something like this played out among one of Wolfram’s physics detours. He said something I agree with, which is that the concept of space is largely one which we construct and even changing our perception by the small adjustment of “think a million times faster” could break that construct. He argued that an AI might have a conception of physics which is totally alien to us and also valid. However, he then said it would still look to us like it was following our physics without making the (obvious to me) connection that we could just consider it in our reference frame if we want to know whether it kills us. This was emblematic of several rabbit holes. Yudkowsky would say something like “AI will do bad things” and Wolfram would respond with something like “well what is ‘bad’ really.” It would have been, in my view, entirely legitimate to throw out disinterested empiricism and just say, from our point of view, we don’t want to all die, so let’s figure out whether that happens. We might mess up the fine details of the subjective experience of the AI or what its source code is aiming for, but we can just draw a circle around things that from our point of view steer the universe to certain configuration and ask whether we’ll like those configurations.

I was frustrated by how long they spent finding a framework they could both work in. At the risk of making a parody of myself, part of me wished that Yudkowsky chose to talk to someone who had read the sequences. Aside from the selection issues inherent to only arguing with people who have already read a bunch of Yudkowsky, I don’t think it would help anyway. This conversation was in some ways less frustrating to me than the one Yudkowsky had with Ngo a few years ago, and Ngo has steeped himself in capital-R Bay Area Rationalism. As a particular example, it seemed to me like Ngo thought you could train an AI to make predictions about the world and you would be free to use that prediction to do things in the world, because you just asked the AI to make a prediction instead of doing anything. I don’t see how what he was saying wasn’t isomorphic to saying that you can stop someone from ever making bad things happen by letting them tell you to do things and you do it instead of them doing it. Maybe this was a deficiency of security mindset, maybe it was intuition about the type of AI that would arise from current research trends based in experience, or who knows, but I kept thinking to myself that Ngo wasn’t thinking outside of the box enough when he argued against doom. In that sense, Wolfram was more interesting to listen to, because he actually chased down the idea of where bizarre goals might come from in gradient descent, abstracted that out to “AI will likely have at least one subgoal that wasn’t really intended from the space of goals,” and then considered the question of whether an arbitrary goal is, on average, lethal. His intuition seemed to be that if you fill every goal in goal space you end up with something like the set of every possible mollusk shell which each ends up serving some story in the environment. He didn’t have an intuition for goal+smart=omnicide, and he also he got too hung up on what “goal” and “smart” actually “meant” rather than just running with the thing which it seems to me that Yudkowsky is clearly aiming at even if Yudkowsky uses anthropomorphism to point at that thing. At least he ended up with something that seemed to directionally resemble Yudkowsky’s actual concerns, even if it wasn’t what he wanted to talk about for some reason. Also, Wolfram gets to the end and says “hey man, you should firm up your back-of-envelope calculations because we don’t have shared intuition” when the thing Yudkowsky was trying to do with him for the past three hours was firm up those intuitions.

I keep listening to Yudkowsky argue with people about AI ruin because I have intuitions for why it is hard to create AI that won’t kill us, but I think that Yudkowsky thinks it’s even harder, and I don’t actually know why. I get that something that is smart and wants something a lot will tend to get the thing even if killing me is a consequence. But my intuition says that AI will have goals which lead it to kill me primarily because humans are bad at making AI to the specifications that they intended rather than because goals are inherently dangerous. The current regime of AI development where we just kind of try random walks through the space of linear algebra until we get algorithms that do what we want seems to obviously be a good way to make something sort of aligned with us with wild edge cases that will kill us once it generalizes. If we were actually creating our algorithms by hand, I can just look out in the world of code full of bugs and easily imagine a bug that only shows up as a misaligned goal in the AI once it’s deployed out in the world and too smart to stop. I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms. I’m guessing that there is a counterfactual problem set that I could complete that would help me truly understand why most perfect algorithms that recreate a strawberry on the ~~molecular~~ cellular level destroy the planet as well. Yudkowsky has said that he’s not even sure it would be aligned if you took his brain and ran it many times faster with more memory. I’ve read enough Dath Ilan fiction to guess that he’s (at least) worried about something in the class of “human brains have some exploitable vulnerability that leads to occasional optical illusions in the current environment but leads to omnicide out of distribution,” but I’m not sure that’s right because I haven’t yet seen someone ask him that question. People keep asking him to refute terribly clever solutions which he already wrote about not working in 2007 rather than actually nailing down why he’s worried.

If I was going to try to work out for myself why (or if) even humans who make what they intend to make get AI wrong on their first try instead of wistfully hoping Yudkowsky explains himself better some day, I would probably follow two threads: One is instrumental convergence, which leads anything going hard enough to move toward collecting all available negentropy (or money or power depending on the limits of the game, hopefully I don’t have to explain this one here). I don’t actually get why almost every goal will make an AI go hard enough, but I can imagine an AI being told to build as much manufacturing capability as possible going hard enough, and that’s an obvious place to point an AI, so I guess the world is already doomed. The second is to start with simple goals like paperclips or whatever and build some argument that generalizes from discrete physical goals which are obviously lethal if you go hard enough to complex goals like “design, but do not implement, a safe fusion reactor” that it seems obvious to point an AI at. I suppose it doesn’t matter if I figure this out because I’m already convinced AI will kill us if we keep doing what we’re doing, so why chase down edge cases where we die anyway pursuing paths that humanity doesn’t seem to possess enough dignity to pursue? Somehow I find myself wanting to know anyway, and I don’t yet have the feeling of truly understanding.