[it was easier to draw some things vs. write them]
AGI destroys the world by default; a variety of deliberate actions can stop this, but there isn’t a smooth and continuous process that rolls us from 2022 to an awesome future, with no major actions or events deliberately occurring to set us on that trajectory.
This seems to conflate multiple claims. Consider the whole trajectory.
“AGI destroys the world by default”—seem clear, I interpret it is as “if you straightforwardly extrapolate past trajectory, we end in catastrophe””
It’s less clear to me what the rest means.
Option a) “trajectories like in the picture bellow do not exist”
(note the turn is smooth)
This seems very strong a claim to me, and highly implausible. Still, if I understand correctly, this is what you put most weight on.
Option b) “trajectories like in a) exist, but it won’t be our trajectory, without significant deliberate efforts”
This seems plausible, although the word “deliberate” introduces some ambiguity.
One way to think about this is in terms of “steering forces” and incentive gradients. In my view it is more likely than not that with increasing power of the systems, parts of “alignment” will become more of a convergent goal for developers (e.g. because aligned systems get better performance, or alignment tools and theory helps you with designing more competitive systems). I’m not sure if you would count that as “deliberate” (my guess: no). (Just to be sure: this isn’t to claim that this pull is sufficient for safety.)
In my view the the steering forces can become sufficiently strong without any legible “major event”. In particular without any event legible as important when it is happening. (As far as I understand, you would strongly disagree)
In contrast, pivotal act would look more like this:
I don’t think this is necessary or even common feature of winning trajectories.
“Complex situations don’t get resolved via phase transitions” and “heroism never makes a big difference in real life” are extremely general objections.
Sorry but this reads like a strawman of my position. “Heroic changes are mostly not the way you improve safety of complex systems.” is a very different claim to “heroism never makes a big difference in real life”.
To convey the intuition, consider the case of a nuclear power plant. How do you make something like that safe? Basically, not by one strong intervention on one link in a causal graph, but by intervening at a large fraction of the causal graph, and by adding layered defense, preventing failures from propagating.
Heroic acts obviously can make a big difference. In the case of the nuclear power plant, some scenarios could be saved by a team of heroic firefighters who will provide emergency cooling. Or, clearly, a Chernobyl disaster would have been prevented if a SWAT team landed in the control room, shot everyone, and stopped the plant in a safe way.
My claim isn’t that this never works. The only claim is that the majority of bits of safety originates from a different types of intervention (And I do think this is also true for AI safety.)
There is no natural force…
As is probably clear, I like the forces framing. Note that it feels quite different from the “pivotal acts” framing.
I don’t care that much whether the forces are natural or not, but whether they exist. Actually I do think one of the more useful things to do about AI safety is— think about directions in which you want movement - think about “types” of forces which may pull in that direction (where “type” could be e.g. profit incentives from market, cultural incentives, or instrumental technological usefulness) -think about what sort of a system is able to exert such force (where the type could be e.g. individual engineer, a culture-based superagent, or even useful math theory) - this 3d space gives you a lot of combinations. Compare, choose and execute
At the level of abstraction “complex event”, sure, complicated stuff is often continuous in various ways. …
This isn’t what I mean. I don’t advocate for people to throw out all the details. I mostly advocate for people to project the very high-dimensional real world situation into low-dimensional representations which are continuous, as opposed to categorical.
Moreover, you (and Eliezer, and others) have a strong tendency to discretize the projections in an iterative way. Let’s say you start with “pivotal acts”. In the next step, you discretize the “power of system” dimension: “strong systems” are capable of pivotal acts, “weak systems” are not. In the next step, you use this to discretize a bunch of other dimensions—e.g. weak interpretability tools help with weak systems, but not with strong systems. And so on. The endpoint are just a few actually continuous dimensions, and a longer list of discrete labels.
To be clear: I’m very much in favour of someone trying this.(I expect this to fail, at least for now.)
But I’m also very much in favour of many people trying to not do this, and focusing more on trying different projection. Or looking for steepest local gradient descend updates from the point where we are now.
But I think EA thus far has mostly made the opposite error, refusing to go concrete and thereby avoiding the pressure and constraint of having to actually plan, face tradeoffs, entertain unpleasant realities, etc. (...)
Sorry but I’m confused how the EA label landed here and I’m a bit worried it has some properties of a red herring. I don’t know if the “you” is directed at me, “EA” (whatever it is), or readers of our conversation
I think the diagram could be better drawn with at least one axis with a scale like “potential AI cognitive capability”.
At the bottom, in the big white zone, everything is safe and nothing is amazing.
Further up the page, some big faint green “applications of AI” patches appear in which things start to be nicer in some ways. There are also some big faint red patches, many of which overlap the green, where misapplication of AI makes things worse in some ways.
As you go up the page, both the red and green regions intensify, and some of the deeper green regions dead-end into black representing paths that can no longer be averted from extinction or other uncorrectable bad futures. Some big patches of black start to appear straight in front of white or pale green, representing humanity holding off from implementing AGI until they thought alignment was solved, but it went wrong before any benefits could appear.
By the time you reach the top of the page, it is almost all black. There are a few tiny spots of intense green, connected only by thin, zig-zag threads that are mostly white to lower parts of the page. Even at the top of the page, we don’t know which of those brilliant green points might actually lead to dead-ends into black further up.
That’s roughly how I see the alignment landscape: that steering to those brilliant green specks will mostly require avoiding implementing AGI.
[it was easier to draw some things vs. write them]
This seems to conflate multiple claims. Consider the whole trajectory.
“AGI destroys the world by default”—seem clear, I interpret it is as “if you straightforwardly extrapolate past trajectory, we end in catastrophe””
It’s less clear to me what the rest means.
Option a) “trajectories like in the picture bellow do not exist”
(note the turn is smooth)
This seems very strong a claim to me, and highly implausible. Still, if I understand correctly, this is what you put most weight on.
Option b) “trajectories like in a) exist, but it won’t be our trajectory, without significant deliberate efforts”
This seems plausible, although the word “deliberate” introduces some ambiguity.
One way to think about this is in terms of “steering forces” and incentive gradients. In my view it is more likely than not that with increasing power of the systems, parts of “alignment” will become more of a convergent goal for developers (e.g. because aligned systems get better performance, or alignment tools and theory helps you with designing more competitive systems). I’m not sure if you would count that as “deliberate” (my guess: no). (Just to be sure: this isn’t to claim that this pull is sufficient for safety.)
In my view the the steering forces can become sufficiently strong without any legible “major event”. In particular without any event legible as important when it is happening. (As far as I understand, you would strongly disagree)
In contrast, pivotal act would look more like this:
I don’t think this is necessary or even common feature of winning trajectories.
Sorry but this reads like a strawman of my position. “Heroic changes are mostly not the way you improve safety of complex systems.” is a very different claim to “heroism never makes a big difference in real life”.
To convey the intuition, consider the case of a nuclear power plant. How do you make something like that safe? Basically, not by one strong intervention on one link in a causal graph, but by intervening at a large fraction of the causal graph, and by adding layered defense, preventing failures from propagating.
Heroic acts obviously can make a big difference. In the case of the nuclear power plant, some scenarios could be saved by a team of heroic firefighters who will provide emergency cooling. Or, clearly, a Chernobyl disaster would have been prevented if a SWAT team landed in the control room, shot everyone, and stopped the plant in a safe way.
My claim isn’t that this never works. The only claim is that the majority of bits of safety originates from a different types of intervention (And I do think this is also true for AI safety.)
As is probably clear, I like the forces framing. Note that it feels quite different from the “pivotal acts” framing.
I don’t care that much whether the forces are natural or not, but whether they exist. Actually I do think one of the more useful things to do about AI safety is—
think about directions in which you want movement
- think about “types” of forces which may pull in that direction (where “type” could be e.g. profit incentives from market, cultural incentives, or instrumental technological usefulness)
-think about what sort of a system is able to exert such force (where the type could be e.g. individual engineer, a culture-based superagent, or even useful math theory)
- this 3d space gives you a lot of combinations. Compare, choose and execute
This isn’t what I mean. I don’t advocate for people to throw out all the details. I mostly advocate for people to project the very high-dimensional real world situation into low-dimensional representations which are continuous, as opposed to categorical.
Moreover, you (and Eliezer, and others) have a strong tendency to discretize the projections in an iterative way. Let’s say you start with “pivotal acts”. In the next step, you discretize the “power of system” dimension: “strong systems” are capable of pivotal acts, “weak systems” are not. In the next step, you use this to discretize a bunch of other dimensions—e.g. weak interpretability tools help with weak systems, but not with strong systems. And so on. The endpoint are just a few actually continuous dimensions, and a longer list of discrete labels.
To be clear: I’m very much in favour of someone trying this.(I expect this to fail, at least for now.)
But I’m also very much in favour of many people trying to not do this, and focusing more on trying different projection. Or looking for steepest local gradient descend updates from the point where we are now.
Sorry but I’m confused how the EA label landed here and I’m a bit worried it has some properties of a red herring. I don’t know if the “you” is directed at me, “EA” (whatever it is), or readers of our conversation
I think the diagram could be better drawn with at least one axis with a scale like “potential AI cognitive capability”.
At the bottom, in the big white zone, everything is safe and nothing is amazing.
Further up the page, some big faint green “applications of AI” patches appear in which things start to be nicer in some ways. There are also some big faint red patches, many of which overlap the green, where misapplication of AI makes things worse in some ways.
As you go up the page, both the red and green regions intensify, and some of the deeper green regions dead-end into black representing paths that can no longer be averted from extinction or other uncorrectable bad futures. Some big patches of black start to appear straight in front of white or pale green, representing humanity holding off from implementing AGI until they thought alignment was solved, but it went wrong before any benefits could appear.
By the time you reach the top of the page, it is almost all black. There are a few tiny spots of intense green, connected only by thin, zig-zag threads that are mostly white to lower parts of the page. Even at the top of the page, we don’t know which of those brilliant green points might actually lead to dead-ends into black further up.
That’s roughly how I see the alignment landscape: that steering to those brilliant green specks will mostly require avoiding implementing AGI.