“The first argument–paperclip maximizing–is coherent in that it treats the AGI’s goal as fixed and given by a human (Paperclip Corp, in this case). But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made”
This argument is essentially addressed by this post, and has many failure modes. For example, if you specify the superintelligence’s goal as the example you gave, it’s most optimal solution might be to cryopreserve the brain of every human in a secure location, and prevent any attempts an outside force could make at interfacing with them. You realize this, and so you specify something like “Make as many squiggles as possible whilst leaving humans in control of their future”, and the intelligence is quite smart and quite general, so it can comprehend the notion of what you want when you say “we want control of our future”, but then BayAreaAILab#928374 trains a superintelligence designed to produce squiggles without this limit and outcompetes the aligned intelligence, because humans are much less efficient than inscrutable matrices.
AGI-risk argument responds by saying, well, paperclip-maximizing is just a toy thought experiment for people to understand. In fact, the inscrutable matrices will be maximizing a reward function, and you have no idea what that actually is, it might be some mesa-optimizer
But I don’t feel as though your referencing to Eliezer’s Twitter loss drop fiasco and subsequent argument regarding GPU maximization successfully refutes claims regarding mesa-optimization. Even if GPU-maximizing mesa-optimization was intractable, what about the other potentially infinite number of possible mesa-optimizer configurations that result ?
You don’t know that human brains can be hacked using VR headsets; it has never been demonstrated that it’s possible and there are common sense reasons to think it’s not. The brain is an immensely complicated, poorly-understood organ. Applying a lot of computing power to that problem is very unlikely to yield total mastery of it by shining light in someone’s eyes
When Eliezer talks about ‘brain hacking’ I do not believe he means by dint of a virtual reality headset. Psychological manipulation is an incredibly powerful tool, and who else could manipulate humanity if not a superintelligence? Furthermore, if said intelligence models humans via simulating strategies, which that post argues is likely assuming large capabilities gaps between humanity and a hypothetical superintelligence.
As I said before, I’m very confused about how you get to >90% chance of doom given the complexity of the systems we’re discussing
The analogy of “forecasting the temperature of the coffee in 5 minutes” VS “forecasting that if left the coffee will get cold at some point” seems relevant here. Without making claims about the intricacies of the future state of a complex system, you can make high-reliability inferences about their future trajectories in more general terms. This is how I see AI x-risk claims. If the claim was that there was a 90% chance that a superintelligence will render humanity extinct and it will have some architecture x I would agree with you, but feel as though Eliezer’s forecast is general enough to be reliable.
Thanks for your reply. I welcome an object-level discussion, and appreciate people reading my thoughts and showing me where they think I went wrong.
The hidden complexity of wishes stuff is not persuasive to me in the context of an argument that AI will literally kill everyone. If we wish for it not to, there might be some problems with the outcome, but it won’t kill everyone. In terms of Bay Area Lab 9324 doing something stupid, I think by the time thousands of labs are doing this, if we have been able to successfully wish for stuff without catastrophe being triggered, it will be relatively easy to wish for universal controls on the wishing technology.
“Infinite number of possible mesa-optimizers”. This feels like just invoking an unknown unknown to me, and then asserting that we’re all going to die, and feels like it’s missing some steps.
You’re wrong about Eliezer’s assertions about hacking, he 100% does believe by dint of a VR headset. I quote: “—Hack a human brain—in the sense of getting the human to carry out any desired course of action, say—given a full neural wiring diagram of that human brain, and full A/V I/O with the human (eg high-resolution VR headset), unsupervised and unimpeded, over the course of a day: DEFINITE YES—Hack a human, given a week of video footage of the human in its natural environment; plus an hour of A/V exposure with the human, unsupervised and unimpeded: YES ”
I get the analogy of all roads leading to doom, but it’s just very obviously not like that, because it depends on complex systems that are very hard to understand, and AI x-risk proponents are some of the biggest advocates of that opacity.
Soft upvoted your reply, but have some objections. I will respond using the same numbering system you did such that point 1 in my reply will address point 1 of yours.
I agree with this in the context of short-term extinction (i.e. at or near the deployment of AGI), but would offer that an inability to remain competitive and loss of control is still likely to end in extinction, but in a less cinematic and instantaneous way. In accordance with this, the potential horizon for extinction-contributing outcomes is expanded massively. Although Yudkowsky is most renowned for hard takeoff, soft takeoff has a very differently shaped extinction-space and (I would assume) is a partial reason for his high doom estimate. Although I cannot know this for sure, I would imagine he has a >1% credence in soft takeoff. ‘Problems with the outcome’ seem highly likely to extend to extinction given time.
There are (probably) an infinite number of possible mesa-optimizers. I don’t see any reason to assume an upper bound on potential mesa-optimization configurations, and yes; this is not a ‘slam dunk’ argument. Rather, as derived from the notion that even slightly imperfect outcomes can extend to extinction, I was suggesting that you are trying to search an infinite space for a quark that fell out of your pocket some unknown amount of time ago whilst you were exploring said space. This can be summed up as ‘it is not probable that some mesa-optimizer selected by gradient descent will ensure a Good Outcome’.
This still does not mean that the only form of brain hacking is via highly immersive virtual reality. I recall the Tweet that this comment came from, and I interpreted it as a highly extreme and difficult form of brain hacking used to prove a point (the point being that if ASI could accomplish this it could easily accomplish psychological manipulation). Eliezer’s breaking out of the sandbox experiments circa 2010 (I believe?) are a good example of this.
Alternatively you can claim some semi-arbitrary but lower extinction risk like 35%, but you can make the same objections to a more mild forecast like that. Why is assigning a 35% probability to an outcome more epistemologically valid than a >90% probability? Criticizing forecasts based on their magnitude seems difficult to justify in my opinion, and critiques should rely on argument only.
I disagree with OPs objections, too, but that’s explicitly not the point of this post. OP is giving us an outside take on how our communication is working, and that’s extremely valuable.
Typically, when someone says you’re not convincing them, “you’re being dumb” is itself a dumb response. If you want to convince someone of something, making the arguments clear is mostly your responsibility.
I disagree with your objections.
This argument is essentially addressed by this post, and has many failure modes. For example, if you specify the superintelligence’s goal as the example you gave, it’s most optimal solution might be to cryopreserve the brain of every human in a secure location, and prevent any attempts an outside force could make at interfacing with them. You realize this, and so you specify something like “Make as many squiggles as possible whilst leaving humans in control of their future”, and the intelligence is quite smart and quite general, so it can comprehend the notion of what you want when you say “we want control of our future”, but then BayAreaAILab#928374 trains a superintelligence designed to produce squiggles without this limit and outcompetes the aligned intelligence, because humans are much less efficient than inscrutable matrices.
This is not even mentioning issues with inner alignment and mesa-optimizers. You start to address this with:
But I don’t feel as though your referencing to Eliezer’s Twitter loss drop fiasco and subsequent argument regarding GPU maximization successfully refutes claims regarding mesa-optimization. Even if GPU-maximizing mesa-optimization was intractable, what about the other potentially infinite number of possible mesa-optimizer configurations that result ?
When Eliezer talks about ‘brain hacking’ I do not believe he means by dint of a virtual reality headset. Psychological manipulation is an incredibly powerful tool, and who else could manipulate humanity if not a superintelligence? Furthermore, if said intelligence models humans via simulating strategies, which that post argues is likely assuming large capabilities gaps between humanity and a hypothetical superintelligence.
The analogy of “forecasting the temperature of the coffee in 5 minutes” VS “forecasting that if left the coffee will get cold at some point” seems relevant here. Without making claims about the intricacies of the future state of a complex system, you can make high-reliability inferences about their future trajectories in more general terms. This is how I see AI x-risk claims. If the claim was that there was a 90% chance that a superintelligence will render humanity extinct and it will have some architecture x I would agree with you, but feel as though Eliezer’s forecast is general enough to be reliable.
Thanks for your reply. I welcome an object-level discussion, and appreciate people reading my thoughts and showing me where they think I went wrong.
The hidden complexity of wishes stuff is not persuasive to me in the context of an argument that AI will literally kill everyone. If we wish for it not to, there might be some problems with the outcome, but it won’t kill everyone. In terms of Bay Area Lab 9324 doing something stupid, I think by the time thousands of labs are doing this, if we have been able to successfully wish for stuff without catastrophe being triggered, it will be relatively easy to wish for universal controls on the wishing technology.
“Infinite number of possible mesa-optimizers”. This feels like just invoking an unknown unknown to me, and then asserting that we’re all going to die, and feels like it’s missing some steps.
You’re wrong about Eliezer’s assertions about hacking, he 100% does believe by dint of a VR headset. I quote: “—Hack a human brain—in the sense of getting the human to carry out any desired course of action, say—given a full neural wiring diagram of that human brain, and full A/V I/O with the human (eg high-resolution VR headset), unsupervised and unimpeded, over the course of a day: DEFINITE YES—Hack a human, given a week of video footage of the human in its natural environment; plus an hour of A/V exposure with the human, unsupervised and unimpeded: YES ”
I get the analogy of all roads leading to doom, but it’s just very obviously not like that, because it depends on complex systems that are very hard to understand, and AI x-risk proponents are some of the biggest advocates of that opacity.
Soft upvoted your reply, but have some objections. I will respond using the same numbering system you did such that point 1 in my reply will address point 1 of yours.
I agree with this in the context of short-term extinction (i.e. at or near the deployment of AGI), but would offer that an inability to remain competitive and loss of control is still likely to end in extinction, but in a less cinematic and instantaneous way. In accordance with this, the potential horizon for extinction-contributing outcomes is expanded massively. Although Yudkowsky is most renowned for hard takeoff, soft takeoff has a very differently shaped extinction-space and (I would assume) is a partial reason for his high doom estimate. Although I cannot know this for sure, I would imagine he has a >1% credence in soft takeoff. ‘Problems with the outcome’ seem highly likely to extend to extinction given time.
There are (probably) an infinite number of possible mesa-optimizers. I don’t see any reason to assume an upper bound on potential mesa-optimization configurations, and yes; this is not a ‘slam dunk’ argument. Rather, as derived from the notion that even slightly imperfect outcomes can extend to extinction, I was suggesting that you are trying to search an infinite space for a quark that fell out of your pocket some unknown amount of time ago whilst you were exploring said space. This can be summed up as ‘it is not probable that some mesa-optimizer selected by gradient descent will ensure a Good Outcome’.
This still does not mean that the only form of brain hacking is via highly immersive virtual reality. I recall the Tweet that this comment came from, and I interpreted it as a highly extreme and difficult form of brain hacking used to prove a point (the point being that if ASI could accomplish this it could easily accomplish psychological manipulation). Eliezer’s breaking out of the sandbox experiments circa 2010 (I believe?) are a good example of this.
Alternatively you can claim some semi-arbitrary but lower extinction risk like 35%, but you can make the same objections to a more mild forecast like that. Why is assigning a 35% probability to an outcome more epistemologically valid than a >90% probability? Criticizing forecasts based on their magnitude seems difficult to justify in my opinion, and critiques should rely on argument only.
I disagree with OPs objections, too, but that’s explicitly not the point of this post. OP is giving us an outside take on how our communication is working, and that’s extremely valuable.
Typically, when someone says you’re not convincing them, “you’re being dumb” is itself a dumb response. If you want to convince someone of something, making the arguments clear is mostly your responsibility.