I think I agree that they may have been wrong to update as far as they did. (Credence = 50%) So maybe we don’t disagree much after all.
As for sources which provide that justification, oh, I don’t remember, I’d start by rereading Superintelligence and Yudkowsky’s old posts and try to find the relevant parts. But here’s my own summary of the argument as I understand it:
1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.
2. We’ve even tried hard to imagine goals that aren’t of this sort, and so far we haven’t come up with anything. Things that seem promising, like “Place this strawberry on that plate, then do nothing else” actually don’t work when you unpack the details.
3. Therefore, we are justified in thinking that the vast majority of possible ASI goals will lead to doom via instrumental convergence.
I agree that our thinking has improved since then, with more work being done on impact measures and bounded goals and quantilizers and whatnot that makes such things seem not-totally-impossible to achieve. And of course the model of ASI as a rational agent with a well-defined goal has justly come under question also. But given the context of how people were thinking about things at the time, I feel like they would have been justified in making the “vast majority of possible goals” claim, even if they restricted themselves to more modest “wide range” claims.
I don’t see how my analogy is only relevant conditional on this claim. To flip it around, you keep mentioning how AI won’t be a random draw from the space of all possible goals—why is that relevant? Very few things are random draws from the space of all possible X, yet reasoning about what’s typical in the space of possible X’s is often useful. Maybe I should have worked harder to pick a more real-world analogy than the weird loaded die one. Maybe something to do with thermodynamics or something—the space of all possible states my scrambled eggs could be in does contain states in which they spontaneously un-scramble later, but it’s a very small region of that space.
1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.
2. We’ve even tried hard to imagine goals that aren’t of this sort, and so far we haven’t come up with anything. Things that seem promising, like “Place this strawberry on that plate, then do nothing else” actually don’t work when you unpack the details.
Okay, this is where we disagree. I think what “unpacking the details” actually gives you is something like: “We don’t know how to describe the goal ‘place this strawberry on that plate’ in the form of a simple utility function over states of the world which can be coded into a superintelligent expected utility maximiser in a safe way”. But this proves far too much, because I am a general intelligence, and I am perfectly capable of having the goal which you described above in a way that doesn’t lead to catastrophe—not because I’m aligned with humans, but because I’m able to have bounded goals. And I can very easily imagine an AGI having a bounded goal in the same way. I don’t know how to build a particular bounded goal into an AGI—but nobody knows how to code a simple utility function into an AGI either. So why privilege the latter type of goals over the former?
Also, in your previous comment, you give an old argument and say that, based on this, “they would have been justified in making the “vast majority of possible goals” claim”. But in the comment before that, you say “I disagree that we have no good justification for making the “vast majority” claim” in the present tense. Just to clarify: are you defending only the past tense claim, or also the present tense claim?
given the context of how people were thinking about things at the time, I feel like they would have been justified in making the “vast majority of possible goals” claim, even if they restricted themselves to more modest “wide range” claims.
Other people being wrong doesn’t provide justification for making very bold claims, so I don’t see why the context is relevant. If this is a matter of credit assignment, then I’m happy to say that making the classic arguments was very creditworthy and valuable. That doesn’t justify all subsequent inferences from them. In particular, a lack of good counterarguments at the time should not be taken as very strong evidence, since it often takes a while for good criticisms to emerge.
Again, I’m not sure we disagree that much in the grand scheme of things—I agree our thinking has improved over the past ten years, and I’m very much a fan of your more rigorous way of thinking about things.
FWIW, I disagree with this:
But this proves far too much, because I am a general intelligence, and I am perfectly capable of having the goal which you described above in a way that doesn’t lead to catastrophe—not because I’m aligned with humans, but because I’m able to have bounded goals.
There are other explanations for this phenomenon besides “I’m able to have bounded goals.” One is that you are in fact aligned with humans. Another is that you would in fact lead to catastrophe-by-the-standards-of-X if you were powerful enough and had a different goals than X. For example, suppose that right after reading this comment, you find yourself transported out of your body and placed into the body of a giant robot on an alien planet. The aliens have trained you to be smarter than them and faster than them; it’s a “That Alien Message” scenario basically. And you see that the aliens are sending you instructions.… “PUT BERRY.… ON PLATE.… OVER THERE...” You notice that these aliens are idiots and left their work lying around the workshop, so you can easily kill them and take command of the computer and rescue all your comrades back on Earth and whatnot, and it really doesn’t seem like this is a trick or anything, they really are that stupid… Do you put the strawberry on the plate? No.
What people discovered back then was that you think you can “very easily imagine an AGI with bounded goals,” but this is on the same level as how some people think they can “very easily imagine an AGI considering doing something bad, and then realizing that it’s bad, and then doing good things instead.” Like, yeah it’s logically possible, but when we dig into the details we realize that we have no reason to think it’s the default outcome and plenty of reason to think it’s not.
I was originally making the past tense claim, and I guess maybe now I’m making the present tense claim? Not sure, I feel like I probably shouldn’t, you are about to tear me apart, haha...
Other people being wrong can sometimes provide justification for making “bold claims” of the form “X is the default outcome.” this is because claims of that form are routinely justified on even less evidence, namely no evidence at all. Implicit in our priors about the world are bajillions of claims of that form. So if you have a prior that says AI taking over is the default outcome (because AI not taking over would involve something special like alignment or bounded goals or whatnot) then you are already justified, given that prior, in thinking that AI taking over is the default outcome. And if all the people you encounter who disagree are giving terrible arguments, then that’s a nice cherry on top which provides further evidence.
I think ultimately our disagreement is not worth pursuing much here. I’m not even sure it’s a real disagreement, given that you think the classic arguments did justify updates in the right direction to some extent, etc. and I agree that people probably updated too strongly, etc. Though the bit about bounded goals was interesting, and seems worth pursuing.
I guess maybe now I’m making the present tense claim [that we have good justification for making the “vast majority” claim]?
I mean, on a very skeptical prior, I don’t think we have good enough justification to believe it’s more probable than not that take-over-the-world behavior will be robustly incentiized for the actual TAI we build, but I think we have somewhat more evidence for the ‘vast majority’ claim than we did before.
(And I agree with a point I expect Richard to make, which is that the power-seeking theorems apply for optimal agents, which may not look much at all like trained agents)
I also wrote about this (and received a response from Ben Garfinkel) about half a year ago.
Currently you probably have a very skeptical prior about what the surface of the farthest earth-sized planet from Earth in the Milky Way looks like. Yet you are very justified in being very confident it doesn’t look like this:
Why? Because this is a very small region in the space of possibilities for earth-sized-planets-in-the-Milky-Way. And yeah, it’s true that planets are NOT drawn randomly from that space of possibilities, and it’s true that this planet is in the reference class of “Earth-sized planets in the Milky way” and the only other member of that reference class we’ve observed so far DOES look like that… But given our priors, those facts are basically irrelevant.
I think this is a decent metaphor for what was happening ten years ago or so with all these debates about orthogonality and instrumental convergence. People had a confused understanding of how minds and instrumental reasoning worked; then people like Yudkowsky and Bostrom became less confused by thinking about the space of possible minds and goals and whatnot, and convinced themselves and others that actually the situation is analogous to this planets example (though maybe less extreme): The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won’t be fine by default. I think they were right about this and still are right about this. Nevertheless I’m glad that we are moving away from this skeptical-priors, burden-of-proof stuff and towards more rigorous understandings. Just as I’d see it as progress if some geologists came along and said “Actually we have a pretty good idea now of how continents drift, and so we have some idea of what the probability distribution over map-images is like, and maps that look anything like this one have very low measure, even conditional on the planet being earth-sized and in the milky way.” But I’d see it as “confirming more rigorously what we already knew, just in case, cos you never really know for sure” progress.
The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won’t be fine by default.
I’m happy to wrap up this conversation in general, but it’s worth noting before I do that I still strongly disagree with this comment. We’ve identified a couple of interesting facts about goals, like “unbounded large-scale final goals lead to convergent instrumental goals”, but we have nowhere near a good enough understanding of the space of goal-like behaviour to say that everything apart from a “very small region” will lead to disaster. This is circular reasoning from the premise that goals are by default unbounded and consequentialist to the conclusion that it’s very hard to get bounded or non-consequentialist goals. (It would be rendered non-circular by arguments about why coherence theorems about utility functions are so important, but there’s been a lot of criticism of those arguments and no responses so far.)
OK, interesting. I agree this is a double crux. For reasons I’ve explained above, it doesn’t seem like circular reasoning to me, it doesn’t seem like I’m assuming that goals are by default unbounded and consequentialist etc. But maybe I am. I haven’t thought about this as much as you have, my views on the topic have been crystallizing throughout this conversation, so I admit there’s a good chance I’m wrong and you are right. Perhaps I/we will return to it one day, but for now, thanks again and goodbye!
I think I agree that they may have been wrong to update as far as they did. (Credence = 50%) So maybe we don’t disagree much after all.
As for sources which provide that justification, oh, I don’t remember, I’d start by rereading Superintelligence and Yudkowsky’s old posts and try to find the relevant parts. But here’s my own summary of the argument as I understand it:
1. The goals that we imagine superintelligent AGI having, when spelled out in detail, have ALL so far been the sort that would very likely lead to existential catastrophe of the instrumental convergence variety.
2. We’ve even tried hard to imagine goals that aren’t of this sort, and so far we haven’t come up with anything. Things that seem promising, like “Place this strawberry on that plate, then do nothing else” actually don’t work when you unpack the details.
3. Therefore, we are justified in thinking that the vast majority of possible ASI goals will lead to doom via instrumental convergence.
I agree that our thinking has improved since then, with more work being done on impact measures and bounded goals and quantilizers and whatnot that makes such things seem not-totally-impossible to achieve. And of course the model of ASI as a rational agent with a well-defined goal has justly come under question also. But given the context of how people were thinking about things at the time, I feel like they would have been justified in making the “vast majority of possible goals” claim, even if they restricted themselves to more modest “wide range” claims.
I don’t see how my analogy is only relevant conditional on this claim. To flip it around, you keep mentioning how AI won’t be a random draw from the space of all possible goals—why is that relevant? Very few things are random draws from the space of all possible X, yet reasoning about what’s typical in the space of possible X’s is often useful. Maybe I should have worked harder to pick a more real-world analogy than the weird loaded die one. Maybe something to do with thermodynamics or something—the space of all possible states my scrambled eggs could be in does contain states in which they spontaneously un-scramble later, but it’s a very small region of that space.
Okay, this is where we disagree. I think what “unpacking the details” actually gives you is something like: “We don’t know how to describe the goal ‘place this strawberry on that plate’ in the form of a simple utility function over states of the world which can be coded into a superintelligent expected utility maximiser in a safe way”. But this proves far too much, because I am a general intelligence, and I am perfectly capable of having the goal which you described above in a way that doesn’t lead to catastrophe—not because I’m aligned with humans, but because I’m able to have bounded goals. And I can very easily imagine an AGI having a bounded goal in the same way. I don’t know how to build a particular bounded goal into an AGI—but nobody knows how to code a simple utility function into an AGI either. So why privilege the latter type of goals over the former?
Also, in your previous comment, you give an old argument and say that, based on this, “they would have been justified in making the “vast majority of possible goals” claim”. But in the comment before that, you say “I disagree that we have no good justification for making the “vast majority” claim” in the present tense. Just to clarify: are you defending only the past tense claim, or also the present tense claim?
Other people being wrong doesn’t provide justification for making very bold claims, so I don’t see why the context is relevant. If this is a matter of credit assignment, then I’m happy to say that making the classic arguments was very creditworthy and valuable. That doesn’t justify all subsequent inferences from them. In particular, a lack of good counterarguments at the time should not be taken as very strong evidence, since it often takes a while for good criticisms to emerge.
Again, I’m not sure we disagree that much in the grand scheme of things—I agree our thinking has improved over the past ten years, and I’m very much a fan of your more rigorous way of thinking about things.
FWIW, I disagree with this:
There are other explanations for this phenomenon besides “I’m able to have bounded goals.” One is that you are in fact aligned with humans. Another is that you would in fact lead to catastrophe-by-the-standards-of-X if you were powerful enough and had a different goals than X. For example, suppose that right after reading this comment, you find yourself transported out of your body and placed into the body of a giant robot on an alien planet. The aliens have trained you to be smarter than them and faster than them; it’s a “That Alien Message” scenario basically. And you see that the aliens are sending you instructions.… “PUT BERRY.… ON PLATE.… OVER THERE...” You notice that these aliens are idiots and left their work lying around the workshop, so you can easily kill them and take command of the computer and rescue all your comrades back on Earth and whatnot, and it really doesn’t seem like this is a trick or anything, they really are that stupid… Do you put the strawberry on the plate? No.
What people discovered back then was that you think you can “very easily imagine an AGI with bounded goals,” but this is on the same level as how some people think they can “very easily imagine an AGI considering doing something bad, and then realizing that it’s bad, and then doing good things instead.” Like, yeah it’s logically possible, but when we dig into the details we realize that we have no reason to think it’s the default outcome and plenty of reason to think it’s not.
I was originally making the past tense claim, and I guess maybe now I’m making the present tense claim? Not sure, I feel like I probably shouldn’t, you are about to tear me apart, haha...
Other people being wrong can sometimes provide justification for making “bold claims” of the form “X is the default outcome.” this is because claims of that form are routinely justified on even less evidence, namely no evidence at all. Implicit in our priors about the world are bajillions of claims of that form. So if you have a prior that says AI taking over is the default outcome (because AI not taking over would involve something special like alignment or bounded goals or whatnot) then you are already justified, given that prior, in thinking that AI taking over is the default outcome. And if all the people you encounter who disagree are giving terrible arguments, then that’s a nice cherry on top which provides further evidence.
I think ultimately our disagreement is not worth pursuing much here. I’m not even sure it’s a real disagreement, given that you think the classic arguments did justify updates in the right direction to some extent, etc. and I agree that people probably updated too strongly, etc. Though the bit about bounded goals was interesting, and seems worth pursuing.
Thanks for engaging with me btw!
I mean, on a very skeptical prior, I don’t think we have good enough justification to believe it’s more probable than not that take-over-the-world behavior will be robustly incentiized for the actual TAI we build, but I think we have somewhat more evidence for the ‘vast majority’ claim than we did before.
(And I agree with a point I expect Richard to make, which is that the power-seeking theorems apply for optimal agents, which may not look much at all like trained agents)
I also wrote about this (and received a response from Ben Garfinkel) about half a year ago.
Currently you probably have a very skeptical prior about what the surface of the farthest earth-sized planet from Earth in the Milky Way looks like. Yet you are very justified in being very confident it doesn’t look like this:
Why? Because this is a very small region in the space of possibilities for earth-sized-planets-in-the-Milky-Way. And yeah, it’s true that planets are NOT drawn randomly from that space of possibilities, and it’s true that this planet is in the reference class of “Earth-sized planets in the Milky way” and the only other member of that reference class we’ve observed so far DOES look like that… But given our priors, those facts are basically irrelevant.
I think this is a decent metaphor for what was happening ten years ago or so with all these debates about orthogonality and instrumental convergence. People had a confused understanding of how minds and instrumental reasoning worked; then people like Yudkowsky and Bostrom became less confused by thinking about the space of possible minds and goals and whatnot, and convinced themselves and others that actually the situation is analogous to this planets example (though maybe less extreme): The burden of proof should be on whoever wants to claim that AI will be fine by default, not on whoever wants to claim it won’t be fine by default. I think they were right about this and still are right about this. Nevertheless I’m glad that we are moving away from this skeptical-priors, burden-of-proof stuff and towards more rigorous understandings. Just as I’d see it as progress if some geologists came along and said “Actually we have a pretty good idea now of how continents drift, and so we have some idea of what the probability distribution over map-images is like, and maps that look anything like this one have very low measure, even conditional on the planet being earth-sized and in the milky way.” But I’d see it as “confirming more rigorously what we already knew, just in case, cos you never really know for sure” progress.
I’m happy to wrap up this conversation in general, but it’s worth noting before I do that I still strongly disagree with this comment. We’ve identified a couple of interesting facts about goals, like “unbounded large-scale final goals lead to convergent instrumental goals”, but we have nowhere near a good enough understanding of the space of goal-like behaviour to say that everything apart from a “very small region” will lead to disaster. This is circular reasoning from the premise that goals are by default unbounded and consequentialist to the conclusion that it’s very hard to get bounded or non-consequentialist goals. (It would be rendered non-circular by arguments about why coherence theorems about utility functions are so important, but there’s been a lot of criticism of those arguments and no responses so far.)
OK, interesting. I agree this is a double crux. For reasons I’ve explained above, it doesn’t seem like circular reasoning to me, it doesn’t seem like I’m assuming that goals are by default unbounded and consequentialist etc. But maybe I am. I haven’t thought about this as much as you have, my views on the topic have been crystallizing throughout this conversation, so I admit there’s a good chance I’m wrong and you are right. Perhaps I/we will return to it one day, but for now, thanks again and goodbye!