Just to state the reigning orthodoxy among the Wise, if not among the general population: the interface between “AI developers” and “one AI” appears to be hugely more difficult, hugely more lethal, and vastly qualitatively different, from every other interface. There’s a horrible opsec problem with respect to single defectors in the AI lab selling your code to China which then destroys the world, but this horrible opsec problem has nothing in common with the skills and art needed for the purely technical challenge of building an AGI that doesn’t destroy the world, which nobody is at all on course to solve nor has any reasonable plan for solving. There’s a political problem where Earthly governments have no clue what is going on and all such clues lie outside the Overton Window, which, if you had any plan for succeeding at the technical part, would be most reasonably addressed by going off and doing your thing ignoring the governments; the concept of trying to get major Earth governments on board appears to me to be a proposal made either in ignorance of the reality of political feasibilities, or simply as an act of moral inveighing wholly disconnected from reality. Were such a thing possible, the skills and arts going into it would again be mostly unrelated on a technical level to the problem of building a complicated thing, probably using gradient descent, that will be very smart and will not just kill you.
This looks to me like a simply bad paradigm on which to approach things. The technical problem has no plan, and is going to kill us, and nobody knows how to make progress on it; so instead, we have people who go off and work on something that looks less unsolvable, like humans playing nice with each other inside an AI lab, or writing solemn papers about “AI governance” for politicians to ignore; and then they draw a neat graph suggesting that this more solvable problem has anything to do with the technical alignment challenge, which it does not. People who understand the technical difficulties are remaining relatively quiet because they don’t know what to say; this selects for people who don’t understand technical difficulties becoming the remaining eager workers on AI safety. I cannot, from this writeup, see anything going on here which is not simply that.
To be clear on something, my problem here is not with anybody working on AGI governance. It’s not going to work, but you can imagine something in this area that would let us die with more dignity. It’s not even possible for it to work without a technical solution nobody has, but if you can’t see anything to do about technical solutions, then getting AGI governance into a shape where it would be better placed to handle a technical miracle, if you can do that, lets us die with more dignity. I could wish that people in governance were more frank and open about acknowledging this, and when they don’t acknowledge it, I expect them to be so ignorant of the real difficulties that their governance work will also be counterproductive; but you can imagine there being somebody who understood the real difficulties and acknowledged them and knew that all they were doing, probably, was helping us die with more dignity, and who also understood the real difficulties in politics, and maybe those people would be productive. But when the exposition is drawing strained analogies between the “developers to AGI” interface and the “Earth’s pretense-of-representative-democracy governments to AGI labs” interface, I see no reason for hope; this is just somebody who understands neither kind of problem and is just going to do damage.
My guess is an attempt to explain where I think we actually differ in “generative intuitions” will be more useful than a direct response to your conclusions, so here it is. How to read it: roughly, this is attempting to just jump past several steps of double-crux to the area where I suspect actual cruxes lie.
Continuity
In my view, your ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems “weak, won’t kill you, but also won’t help you with alignment” and “strong—would help you with alignment, but, unfortunately, will kill you by default”. Discontinuities everywhere—“bad systems are just one sign flip away”, sudden jumps in capabilities, etc. Thinking in symbolic terms.
In my inside view, in reality, things are instead mostly continuous. Discontinuities sometimes emerge out of continuity, sure, but this is often noticeable. If you get some interpretability and oversight things right, you can slow down before hitting the abyss. Also the jumps are often not true “jumps” under closer inspection.
I don’t think there is any practical way to reconcile this difference of intuitions—my guess is intuitions about continuity/discreteness are quite deep-seated, and based more on how people do maths, rather than some specific observation about the world. In practice, for most people, the “intuition” is something like a deep net trained on whole life of STEM reasoning - they won’t update on individual datapoints, and if they are smart, they are able to re-interpret the observations to be in line with their view. Also I think trying to get you to share my continuous intuition is mostly futile—my hypothesis is this is possibly the top deep crux of your disagreements with Paul, and reading the debates between you two gives me little hope of you switching to a “continuous” perspective.
I also believe that the “discrete” ontology is great for noticing problems and served you well in noticing many deep and hard problems. (I use it to spot problems sometimes too.) At the same time, it’s likely much less useful for solving the problems.
Also, if anything, how SOTA systems look suggest mostly continuity, stochasticity, “biology”, “emergence”. Usually no proofs, no symbolically verifiable guarantees..
Things will be weird before getting extremely weird
Assuming continuity, things will get weird before getting extremely weird. This likely includes domains such as politics, geopolitics, experience of individual humans,… My impression is that you are mostly imagining just slightly modified politics, quite similar to today.. In this context, a gradient-descending model in some datacentre hits the “core of consequentialist reasoning”, we are all soon dead. I see that this is possible, but I bet more on scenarios where we get AGI when politics is very different compared to today.
Models of politics
Actually, we also probably disagree about politics. Correct me if I’m wrong, but your “mainline” winning scenario was and still is something like the leading team creating an aligned AGI, this system gets decisive strategic advantage, and “solves” politics by forming a singleton (and preventing all other teams to develop AGI). Decisive pivotal acts, and so on.
To me, this seems an implausible and dangerous theory of how to solve politics, in the real world, in continuous takeoffs. Continuity will usually mean no one gets a decisive advantage—the most powerful AI system will be still much weaker than “rest of the world”, and the rest of the world will fight back against takeover.
Under the “ecosystem” view, we will need to solve “ecosystem alignment”—including possible coordination of the ecosystem to prevent formation of superintelligent and unbounded agents.
(It seems likely this would benefit from decent math, similarly to how the math of MAD was instrumental in us not nuking ourselves.)
Sociology of AI safety
I think you have a strange model about which position is “quiet”. Your writing is followed passionately by many: just the latest example, your “dying with dignity” framing got a lot of attention.
My guess is that following you too closely, which many people do, is currently net harmful. I’m sceptical that people who get caught up too much in your way of looking at the problem will make much progress. You’re a master of your way of looking at it, you’ve spent decades thinking about AI safety in this ontology and you don’t see any promising way to solve the problem.
Conclusion
I think what you parse as “a simply bad paradigm on which to approach things” would start to make more sense if you adopted the “continuous” assumptions, and an assumption that the world would be quite weird and complex at the decisive period.
(Personally I do understand how my conclusions would change if I adopted much more “discrete” view, and yes, I would be much more pessimistic about both what I work on, and our prospects.)
I think this comment is lumping together the following assumptions under the “continuity” label, as if there is a reason to believe that either they are all correct or all incorrect (and I don’t see why):
There is large distance in model space between models that behave very differently.
Takeoff will be slow.
It is feasible to create models that are weak enough to not pose an existential risk yet able to sufficiently help with alignment.
I bet more on scenarios where we get AGI when politics is very different compared to today.
I agree that just before “super transformative” ~AGI systems are first created, the world may look very differently than it does today. This is one of the reasons I think Eliezer has too much credence on doom.
To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don’t see continuity or “Things will be weird before getting extremely weird” as a crux. (I don’t know why you think he does, and don’t know what he thinks, but would guess he doesn’t think it’s a crux either)
I’ve been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie.
I think exploring the whole debate-tree or argument map would be quite long, so I’ll just try to gesture at how some of these things are connected, in my map.
- pivotal acts vs. pivotal processes —my take is people’s stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions—what do you believe about pivotal acts?
- assuming continuity, do you expect existing non-human agents to move important parts of their cognition to AI substrates? -- if yes, do you expect large-scale regulations around that? --- if yes, will it be also partially automated?
- different route: assuming continuity, do you expect a lot of alignment work to be done partially by AI systems, inside places like OpenAI? -- if at the same time this is a huge topic for the whole society, academia and politics, would you expect the rest of the world not trying to influence this?
- different route: assuming continuity, do you expect a lot of “how different entities in the world coordinate” to be done partially by AI systems? -- if yes, do you assume technical features of the system matter? like, eg., multi-agent deliberation dynamics?
- assuming the world notices AI safety as problem (it did much more since writing this post) -- do you expect large amount of attention and resources of academia and industry will be spent on AI alignment? --- would you expect this will be somehow related to the technical problems and how we understand them? --- eg do you think it makes no difference to the technical problem if 300 or 30k people work on it? ---- if it makes a difference, does it make a difference how is the attention allocated?
Not sure if the doublecrux between us would rest on the same cruxes, but I’m happy to try :)
The concept of “interfaces of misalignment” does not mainly point to GovAI-style research here (although it also may serve as a framing for GovAI). The concrete domains separated by the interfaces in the figure above are possibly a bit misleading in that sense:
For me, the “interfaces of misalignment” are generating intuitions about what it means to align a complex system that may not even be self-aligned—rather just one aligning part of it. It is expanding not just the space of solutions, but also the space of meanings of “success”. (For example, one extra way to win-lose: consider world trajectories where our preferences are eventually preserved and propagated in a way that we find repugnant now but with a step-by-step endorsed trajectory towards it.)
My critique of the focus on “AI developers” and “one AI” interface in isolation is that we do not really know what the “goal of AI alignment” is, and it works with a very informal and a bit simplistic idea of what aligning AGI means (strawmannable as “not losing right away”).
While a broader picture may seem to only make the problem strictly harder (“now you have 2 problems”), it can also bring new views of the problem. Especially, new views of what we actually want and what it means to win (which one could paraphrase as a continuous and multi-dimensional winning/losing space).
I don’t see a difference between 1, 2, 3 in practice as a judgment that could be reasonably made; my general sense of that post is that it falls two-thirds of the way to 3 from 4? All it’s missing is an explicit acknowledgment that it’s just a run at death with dignity. The political parts aren’t written in a way that strikes me as naive and there’s no attempt to blur the border between the technical problem and the political problem.
The way you phrase the last paragraph of your comment seemed to imply that there’s nobody alive working on “AI governance” that attacks the problem at >= level 3. Do you not see Thane or people with his worldview/action plan as being “AI governance” people?
My model of EY doesn’t know what the real EY knows. However, there seems to be overwhelming evidence that non-AI alignment is a bottleneck and that network learning similar to what’s occurring naturally is likely to be a relevant path to developing dangerously capable AI.
For my model of EY, “halt, melt and catch fire” seems overdetermined. I notice I am confused.
Just to state the reigning orthodoxy among the Wise, if not among the general population: the interface between “AI developers” and “one AI” appears to be hugely more difficult, hugely more lethal, and vastly qualitatively different, from every other interface. There’s a horrible opsec problem with respect to single defectors in the AI lab selling your code to China which then destroys the world, but this horrible opsec problem has nothing in common with the skills and art needed for the purely technical challenge of building an AGI that doesn’t destroy the world, which nobody is at all on course to solve nor has any reasonable plan for solving. There’s a political problem where Earthly governments have no clue what is going on and all such clues lie outside the Overton Window, which, if you had any plan for succeeding at the technical part, would be most reasonably addressed by going off and doing your thing ignoring the governments; the concept of trying to get major Earth governments on board appears to me to be a proposal made either in ignorance of the reality of political feasibilities, or simply as an act of moral inveighing wholly disconnected from reality. Were such a thing possible, the skills and arts going into it would again be mostly unrelated on a technical level to the problem of building a complicated thing, probably using gradient descent, that will be very smart and will not just kill you.
This looks to me like a simply bad paradigm on which to approach things. The technical problem has no plan, and is going to kill us, and nobody knows how to make progress on it; so instead, we have people who go off and work on something that looks less unsolvable, like humans playing nice with each other inside an AI lab, or writing solemn papers about “AI governance” for politicians to ignore; and then they draw a neat graph suggesting that this more solvable problem has anything to do with the technical alignment challenge, which it does not. People who understand the technical difficulties are remaining relatively quiet because they don’t know what to say; this selects for people who don’t understand technical difficulties becoming the remaining eager workers on AI safety. I cannot, from this writeup, see anything going on here which is not simply that.
To be clear on something, my problem here is not with anybody working on AGI governance. It’s not going to work, but you can imagine something in this area that would let us die with more dignity. It’s not even possible for it to work without a technical solution nobody has, but if you can’t see anything to do about technical solutions, then getting AGI governance into a shape where it would be better placed to handle a technical miracle, if you can do that, lets us die with more dignity. I could wish that people in governance were more frank and open about acknowledging this, and when they don’t acknowledge it, I expect them to be so ignorant of the real difficulties that their governance work will also be counterproductive; but you can imagine there being somebody who understood the real difficulties and acknowledged them and knew that all they were doing, probably, was helping us die with more dignity, and who also understood the real difficulties in politics, and maybe those people would be productive. But when the exposition is drawing strained analogies between the “developers to AGI” interface and the “Earth’s pretense-of-representative-democracy governments to AGI labs” interface, I see no reason for hope; this is just somebody who understands neither kind of problem and is just going to do damage.
My guess is an attempt to explain where I think we actually differ in “generative intuitions” will be more useful than a direct response to your conclusions, so here it is. How to read it: roughly, this is attempting to just jump past several steps of double-crux to the area where I suspect actual cruxes lie.
Continuity
In my view, your ontology of thinking about the problem is fundamentally discrete. For example, you are imaging a sharp boundary between a class of systems “weak, won’t kill you, but also won’t help you with alignment” and “strong—would help you with alignment, but, unfortunately, will kill you by default”. Discontinuities everywhere—“bad systems are just one sign flip away”, sudden jumps in capabilities, etc. Thinking in symbolic terms.
In my inside view, in reality, things are instead mostly continuous. Discontinuities sometimes emerge out of continuity, sure, but this is often noticeable. If you get some interpretability and oversight things right, you can slow down before hitting the abyss. Also the jumps are often not true “jumps” under closer inspection.
I don’t think there is any practical way to reconcile this difference of intuitions—my guess is intuitions about continuity/discreteness are quite deep-seated, and based more on how people do maths, rather than some specific observation about the world. In practice, for most people, the “intuition” is something like a deep net trained on whole life of STEM reasoning - they won’t update on individual datapoints, and if they are smart, they are able to re-interpret the observations to be in line with their view. Also I think trying to get you to share my continuous intuition is mostly futile—my hypothesis is this is possibly the top deep crux of your disagreements with Paul, and reading the debates between you two gives me little hope of you switching to a “continuous” perspective.
I also believe that the “discrete” ontology is great for noticing problems and served you well in noticing many deep and hard problems. (I use it to spot problems sometimes too.) At the same time, it’s likely much less useful for solving the problems.
Also, if anything, how SOTA systems look suggest mostly continuity, stochasticity, “biology”, “emergence”. Usually no proofs, no symbolically verifiable guarantees..
Things will be weird before getting extremely weird
Assuming continuity, things will get weird before getting extremely weird. This likely includes domains such as politics, geopolitics, experience of individual humans,… My impression is that you are mostly imagining just slightly modified politics, quite similar to today.. In this context, a gradient-descending model in some datacentre hits the “core of consequentialist reasoning”, we are all soon dead. I see that this is possible, but I bet more on scenarios where we get AGI when politics is very different compared to today.
Models of politics
Actually, we also probably disagree about politics. Correct me if I’m wrong, but your “mainline” winning scenario was and still is something like the leading team creating an aligned AGI, this system gets decisive strategic advantage, and “solves” politics by forming a singleton (and preventing all other teams to develop AGI). Decisive pivotal acts, and so on.
To me, this seems an implausible and dangerous theory of how to solve politics, in the real world, in continuous takeoffs. Continuity will usually mean no one gets a decisive advantage—the most powerful AI system will be still much weaker than “rest of the world”, and the rest of the world will fight back against takeover.
Under the “ecosystem” view, we will need to solve “ecosystem alignment”—including possible coordination of the ecosystem to prevent formation of superintelligent and unbounded agents.
(It seems likely this would benefit from decent math, similarly to how the math of MAD was instrumental in us not nuking ourselves.)
Sociology of AI safety
I think you have a strange model about which position is “quiet”. Your writing is followed passionately by many: just the latest example, your “dying with dignity” framing got a lot of attention.
My guess is that following you too closely, which many people do, is currently net harmful. I’m sceptical that people who get caught up too much in your way of looking at the problem will make much progress. You’re a master of your way of looking at it, you’ve spent decades thinking about AI safety in this ontology and you don’t see any promising way to solve the problem.
Conclusion
I think what you parse as “a simply bad paradigm on which to approach things” would start to make more sense if you adopted the “continuous” assumptions, and an assumption that the world would be quite weird and complex at the decisive period.
(Personally I do understand how my conclusions would change if I adopted much more “discrete” view, and yes, I would be much more pessimistic about both what I work on, and our prospects.)
I think this comment is lumping together the following assumptions under the “continuity” label, as if there is a reason to believe that either they are all correct or all incorrect (and I don’t see why):
There is large distance in model space between models that behave very differently.
Takeoff will be slow.
It is feasible to create models that are weak enough to not pose an existential risk yet able to sufficiently help with alignment.
I agree that just before
“super transformative”~AGI systems are first created, the world may look very differently than it does today. This is one of the reasons I think Eliezer has too much credence on doom.To briefly hop in and say something that may be useful: I had a reaction pretty similar to what Eliezer commented, and I don’t see continuity or “Things will be weird before getting extremely weird” as a crux. (I don’t know why you think he does, and don’t know what he thinks, but would guess he doesn’t think it’s a crux either)
I’ve been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie.
I think exploring the whole debate-tree or argument map would be quite long, so I’ll just try to gesture at how some of these things are connected, in my map.
- pivotal acts vs. pivotal processes
—my take is people’s stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions—what do you believe about pivotal acts?
- assuming continuity, do you expect existing non-human agents to move important parts of their cognition to AI substrates?
-- if yes, do you expect large-scale regulations around that?
--- if yes, will it be also partially automated?
- different route: assuming continuity, do you expect a lot of alignment work to be done partially by AI systems, inside places like OpenAI?
-- if at the same time this is a huge topic for the whole society, academia and politics, would you expect the rest of the world not trying to influence this?
- different route: assuming continuity, do you expect a lot of “how different entities in the world coordinate” to be done partially by AI systems?
-- if yes, do you assume technical features of the system matter? like, eg., multi-agent deliberation dynamics?
- assuming the world notices AI safety as problem (it did much more since writing this post)
-- do you expect large amount of attention and resources of academia and industry will be spent on AI alignment?
--- would you expect this will be somehow related to the technical problems and how we understand them?
--- eg do you think it makes no difference to the technical problem if 300 or 30k people work on it?
---- if it makes a difference, does it make a difference how is the attention allocated?
Not sure if the doublecrux between us would rest on the same cruxes, but I’m happy to try :)
The concept of “interfaces of misalignment” does not mainly point to GovAI-style research here (although it also may serve as a framing for GovAI). The concrete domains separated by the interfaces in the figure above are possibly a bit misleading in that sense:
For me, the “interfaces of misalignment” are generating intuitions about what it means to align a complex system that may not even be self-aligned—rather just one aligning part of it. It is expanding not just the space of solutions, but also the space of meanings of “success”. (For example, one extra way to win-lose: consider world trajectories where our preferences are eventually preserved and propagated in a way that we find repugnant now but with a step-by-step endorsed trajectory towards it.)
My critique of the focus on “AI developers” and “one AI” interface in isolation is that we do not really know what the “goal of AI alignment” is, and it works with a very informal and a bit simplistic idea of what aligning AGI means (strawmannable as “not losing right away”).
While a broader picture may seem to only make the problem strictly harder (“now you have 2 problems”), it can also bring new views of the problem. Especially, new views of what we actually want and what it means to win (which one could paraphrase as a continuous and multi-dimensional winning/losing space).
Suppose someone tries to do one of these to push us more towards adequacy on this list, according to this rationale. Do you see that as either:
Straightforwardly working on the problem
A respectable attempt to “die with dignity”
A respectable attempt to “die with dignity” depending on more concrete details about the mental model of the person in question
Completely missing the point in a neutral way
Actively counterproductive
I don’t see a difference between 1, 2, 3 in practice as a judgment that could be reasonably made; my general sense of that post is that it falls two-thirds of the way to 3 from 4? All it’s missing is an explicit acknowledgment that it’s just a run at death with dignity. The political parts aren’t written in a way that strikes me as naive and there’s no attempt to blur the border between the technical problem and the political problem.
The way you phrase the last paragraph of your comment seemed to imply that there’s nobody alive working on “AI governance” that attacks the problem at >= level 3. Do you not see Thane or people with his worldview/action plan as being “AI governance” people?
My model of EY doesn’t know what the real EY knows. However, there seems to be overwhelming evidence that non-AI alignment is a bottleneck and that network learning similar to what’s occurring naturally is likely to be a relevant path to developing dangerously capable AI.
For my model of EY, “halt, melt and catch fire” seems overdetermined. I notice I am confused.