This is great, these are substantive issues. I doubt we’ll resolve them just in this thread, but I think these are very worth working through.
As I say that, I remember one lingering doubt: I don’t think anyone will launch a humanity-aligned AGI even if there’s a strong consensus that you’re right and it would work. It seems like the people in charge would always prefer to launch one that’s learning their preferences, rather than the whole species’ aggregated preferences; they prefer their own, by definition.
That might be overcome by public pressure, were it not for the argument that corrigible AGI is fundamentally safer, and you get that with personal intent-aligned (that is, corrigible in the Harms or Christiano sense, or instruction-following AGI in my terminology): this guy really wants me to shut down right now so I will vs. humanity-aligned AGI (humanity would only want me to shut down if I weren’t maximizing its preferences, which the AGI thinks it is by definition, even if it’s wrong.
So maybe even though that question of whether humanity is an adequately natural abstraction is not as high priority to work through? That question seems likely to be addressed only when a human principal is good and ready to hand power over to a humanity’s-values-aligned sovereign AGI that’s been carefully designed with the help of a superintelligent corrigible assistant AGI.
It’s still interesting. So to give just a couple of points toward future discussions:
Yes, LLMs understand quite well what we mean. That doesn’t mean an agent with an LLM core won’t change its mind if it’s truly self-aware and learns autonomously as people do.
I agree that humanity as it exists can be pretty cleanly defined. Defining it that way for all time means that nobody is allowed to enhance their cognition, or to upload. It also means not giving moral status (or at least “voting” status) to any form of AGI or animal (except for “loaned moral status” based on humans’ collective concern for their welfare). You’ve discussed all of these issues in depth in your series AI, Alignment, and Ethics. This is not a limitation everyone will be okay with.
Leaving a set of relatively smart and well-intentioned humans in charge avoids that limitation, as well as providing a means of error-correcting if our first alignment attempt is importantly off course. It is a basin of attraction, like aiming for humanity’s values, but it has a very different nature of including a human as an active component steering the AGIs values/goals into that attractor.
But to continue on the question of whether human values can be defined adequately for long-term stability: you also need to carefully define in what situation these humans would contemplate and refine their values, because human values seem highly path-dependent in what we’ve seen so far.
A lot of that is probably just expanding on your thought when you say:
[...] (at least near AGI level — I’m less convinced at very high ASI levels where the counterfactuals about their creators involved become more extreme) [...]
It seems like if you’ve solved alignment for a sovereign AGI but not for the ASI it will become under RSI, you haven’t solved alignment in a useful way (since it might only be a couple of years before that AGI progresses to ASI and its alignment drifts under powerful reflection and autonomous learning). My hesitations are probably all at the level you’re terming ASI.
(And I should probably just start using ASI for fully general competent agentic AGI).
Yes, the terminology I’m using is AGI = roguishly comparable to human capacity, may be somewhat higher or lower in narrow areas, such that a contest between human society and a rogue AGI is an interesting contest, and may depend on who gets to pick the terrain on which it’s conducted; wheras ASI = at least significantly beyond human capacity across almost all areas that matter, such that a contest between human society and a rogue ASI is a foregone conclusion.
On style of alignment: in the post I touched on the question of what happens if you have multiple ASIs aligned to the well-being of different sets of humans: my prediction was that it very likely leads to an intelligence race and then a high-tech war. This is also my concern for DWIMAC-aligned AI in the possession of different groups of humans: that if the technological difference between the capabilities of different groups get too high, we see a repeat of the events described in Guns, Germs, and Steel. That didn’t happen during the cold war because of Mutual Assured Destruction, since the technological differential between the two sides never got that big (and to the extent that the Soviet Block lost the Cold War, it was primarily because it started to lose the technological race). I agree that Realpolitique may initially pull us towards DWIMAC alignment: I’m concerned that that may be an x-risk in the somewhat longer term. Most likely one human-led faction pulls ahead, and then coopts/concquers/takes over/exterminates all other factions. At the end of which you only have one faction, and if they’re wise enough to realize they don’t want to repeat that, they may move over to a well-being-of-all-humanity aligned design. I’m arguing that we should foresee and avoid that mistake, but I agree there’s a significant risk that we won’t be that wise/magnanimous/sensible.
Anyway, the topic you raises is basically orthogonal to the subject of my post — the technique I outline here can be used to aim for any (philosophically and ethically self-consistent) form of alignment that we can create a large synthetic training set describing a great many examples of. In describing an example of the approach, I assumed my preferred style of alignment, but the technique is broadly applicable, including to DWIMAC alignment. The real question you’re raising is what is a/the stable convergence target for the cycle of self-improvement of aligned AIs assisting us in beilding better-aligned AIs that this technique is intended us to get us to the start of: a topic which is pretty speculative at this point, and more the subject of my posts on the basin of convergence to alignment than this one. It’s an interesting question though, and I’m thinking about it, and if I reach any interesting conclusions I’ll likely write another post.
A very brief stab at this: suppose an ASI is created by a corporation. The purpose of a creation is to maximize the well-being of its creator(s) (see my basin-of-convergenceposts for a justification), in this case the shareholders of the company (in proportion to their shareholding, presumably). The question then becomes to what extent it is in the interests of those shareholders for the ASI to align to the interests of other people as well. The answer to this in a multipolar world where there are several such ASIs of comparable power levels is probably that the risk of war is too high unless they all align significantly to the well-being of all humanity, and only have a preference towards their individual corporate shareholders to whatever limited extent avoids excessive conflict. Whereas in a unipolar world, the sole ASI is capable of outmaneuvering the rest of humanity and creating an oligopoly iof the shareholders, and would presumably do so if it believed that that was in their interest (or under DWIMAC, if they believed it was in their interest). Ethically, humans have a strong instinctive sense of fairness, but that generally applies in situations where individual power levels are comparable and the advantages of cooperating on iterated non-zero-sum games outweigh those of winning in a non-iterated zero-sum game. By definition, taking over the world for your shareholders is a non-iterated zero-sum game, except for situations where conflict can make it negative-sum.
I agree on pretty much every point you’ve raised. I agree that there’s a huge danger in successful DWIMAC or alignment-to-a-person. It could well lead to catastrophic conflict. I think this deserves a lot more analysis, because the creators of AGI will probably going to shoot for that if there’s not a much better argument against than we’ve seen so far.
This was entirely off-topic for this post; I don’t know where we got off topic, but it didn’t start in my last comment. And as you say, I think the choice of alignment target is almost as important as technical alignment techniques.
On the other hand, if alignment to human values isn’t a stable target, we might be better off relying on the good nature of whoever both aligns their AGI to their intent/values, and wins the AGI war. It’s easier to indulge ones’ good nature when there is nearly zero downside to doing so, because you have incontestable control over the known lightcone. Even if horrible things happened in that war, most humans would prefer a happy, flourishing group of humans to be their friend. Sociopaths are the exception, so this route does not fill me with confidence either.
I think there’s more to be worked out here.
Your suggestion that multiple DWIMAC AGIs with different allegiences might establish both the wisdom and a means of cooperating and splitting the rapidly expanding pie. I also place some guarded optimism in that possibility.
I’m not sure if I’m the best person to be thinking/speculating on issues like that: I’m pretty sure I’m a better AI engineer than I am philosopher/ethicist, and there are a lot of people more familiar with the AI policy space than I am. On the other hand, I’m pretty sure I’ve spent longer thinking about the intersection of AI and ethics/philosophy than the great majority of AI engineers have (as in fifteen years), and few of the AI policy people that I’ve read have written much on the “if we solve the Alignment problem, what should we attempt align AI to, and what might the social and Realpolitique consequences of different choices be?” (And then there’s the complicating question of “Are there also internal/technical/stability under reflection/philosophical constraints on that choice?” — to which I strongly suspect the short answer is “yes”, even though I’m not a moral realist.) There was some discussion of this sort of stuff about 10–15 years ago on Less Wrong, but back then we knew a lot less about what sort of AI we were likely to be aligning, what its strengths and weaknesses would be, and how human-like vs alien and incomprehensible an intelligence it would be (the theoretical assumptions back then on Less Wrong tended to be more around some combination of direct construction like AIXI and/or reinforcement learning, rather than SGD token-prediction from the Internet), so we have a lot more useful information now about where the hard and easy parts are likely to be, and about the sociopolitical context.
I feel the same way about being unqualified to consider the geopolitical dynamics. But I also agree that the questions of technical alignment and best alignment target are interconnected (e.g., instruction-following as target seems to make technical alignment much easier). Therefore, I think no single human being is qualified to answer the whole question. As such, I think we need collaboration with people with other expertise. Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?
I agree that we have much better footing to address both the technical and alignment target questions now than 10-15 years ago. So I think we need a new concerted effort.
Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?
Also no, but I’m sure there are many such people reading Less Wrong/the Alignment Forum. Perhaps one or both of us should write posts outlining the issues, and see if we can get a discussion started?
This is great, these are substantive issues. I doubt we’ll resolve them just in this thread, but I think these are very worth working through.
As I say that, I remember one lingering doubt: I don’t think anyone will launch a humanity-aligned AGI even if there’s a strong consensus that you’re right and it would work. It seems like the people in charge would always prefer to launch one that’s learning their preferences, rather than the whole species’ aggregated preferences; they prefer their own, by definition.
That might be overcome by public pressure, were it not for the argument that corrigible AGI is fundamentally safer, and you get that with personal intent-aligned (that is, corrigible in the Harms or Christiano sense, or instruction-following AGI in my terminology): this guy really wants me to shut down right now so I will vs. humanity-aligned AGI (humanity would only want me to shut down if I weren’t maximizing its preferences, which the AGI thinks it is by definition, even if it’s wrong.
So maybe even though that question of whether humanity is an adequately natural abstraction is not as high priority to work through? That question seems likely to be addressed only when a human principal is good and ready to hand power over to a humanity’s-values-aligned sovereign AGI that’s been carefully designed with the help of a superintelligent corrigible assistant AGI.
It’s still interesting. So to give just a couple of points toward future discussions:
Yes, LLMs understand quite well what we mean. That doesn’t mean an agent with an LLM core won’t change its mind if it’s truly self-aware and learns autonomously as people do.
I agree that humanity as it exists can be pretty cleanly defined. Defining it that way for all time means that nobody is allowed to enhance their cognition, or to upload. It also means not giving moral status (or at least “voting” status) to any form of AGI or animal (except for “loaned moral status” based on humans’ collective concern for their welfare). You’ve discussed all of these issues in depth in your series AI, Alignment, and Ethics. This is not a limitation everyone will be okay with.
Leaving a set of relatively smart and well-intentioned humans in charge avoids that limitation, as well as providing a means of error-correcting if our first alignment attempt is importantly off course. It is a basin of attraction, like aiming for humanity’s values, but it has a very different nature of including a human as an active component steering the AGIs values/goals into that attractor.
But to continue on the question of whether human values can be defined adequately for long-term stability: you also need to carefully define in what situation these humans would contemplate and refine their values, because human values seem highly path-dependent in what we’ve seen so far.
A lot of that is probably just expanding on your thought when you say:
It seems like if you’ve solved alignment for a sovereign AGI but not for the ASI it will become under RSI, you haven’t solved alignment in a useful way (since it might only be a couple of years before that AGI progresses to ASI and its alignment drifts under powerful reflection and autonomous learning). My hesitations are probably all at the level you’re terming ASI.
(And I should probably just start using ASI for fully general competent agentic AGI).
Yes, the terminology I’m using is AGI = roguishly comparable to human capacity, may be somewhat higher or lower in narrow areas, such that a contest between human society and a rogue AGI is an interesting contest, and may depend on who gets to pick the terrain on which it’s conducted; wheras ASI = at least significantly beyond human capacity across almost all areas that matter, such that a contest between human society and a rogue ASI is a foregone conclusion.
On style of alignment: in the post I touched on the question of what happens if you have multiple ASIs aligned to the well-being of different sets of humans: my prediction was that it very likely leads to an intelligence race and then a high-tech war. This is also my concern for DWIMAC-aligned AI in the possession of different groups of humans: that if the technological difference between the capabilities of different groups get too high, we see a repeat of the events described in Guns, Germs, and Steel. That didn’t happen during the cold war because of Mutual Assured Destruction, since the technological differential between the two sides never got that big (and to the extent that the Soviet Block lost the Cold War, it was primarily because it started to lose the technological race). I agree that Realpolitique may initially pull us towards DWIMAC alignment: I’m concerned that that may be an x-risk in the somewhat longer term. Most likely one human-led faction pulls ahead, and then coopts/concquers/takes over/exterminates all other factions. At the end of which you only have one faction, and if they’re wise enough to realize they don’t want to repeat that, they may move over to a well-being-of-all-humanity aligned design. I’m arguing that we should foresee and avoid that mistake, but I agree there’s a significant risk that we won’t be that wise/magnanimous/sensible.
Anyway, the topic you raises is basically orthogonal to the subject of my post — the technique I outline here can be used to aim for any (philosophically and ethically self-consistent) form of alignment that we can create a large synthetic training set describing a great many examples of. In describing an example of the approach, I assumed my preferred style of alignment, but the technique is broadly applicable, including to DWIMAC alignment. The real question you’re raising is what is a/the stable convergence target for the cycle of self-improvement of aligned AIs assisting us in beilding better-aligned AIs that this technique is intended us to get us to the start of: a topic which is pretty speculative at this point, and more the subject of my posts on the basin of convergence to alignment than this one. It’s an interesting question though, and I’m thinking about it, and if I reach any interesting conclusions I’ll likely write another post.
A very brief stab at this: suppose an ASI is created by a corporation. The purpose of a creation is to maximize the well-being of its creator(s) (see my basin-of-convergence posts for a justification), in this case the shareholders of the company (in proportion to their shareholding, presumably). The question then becomes to what extent it is in the interests of those shareholders for the ASI to align to the interests of other people as well. The answer to this in a multipolar world where there are several such ASIs of comparable power levels is probably that the risk of war is too high unless they all align significantly to the well-being of all humanity, and only have a preference towards their individual corporate shareholders to whatever limited extent avoids excessive conflict. Whereas in a unipolar world, the sole ASI is capable of outmaneuvering the rest of humanity and creating an oligopoly iof the shareholders, and would presumably do so if it believed that that was in their interest (or under DWIMAC, if they believed it was in their interest). Ethically, humans have a strong instinctive sense of fairness, but that generally applies in situations where individual power levels are comparable and the advantages of cooperating on iterated non-zero-sum games outweigh those of winning in a non-iterated zero-sum game. By definition, taking over the world for your shareholders is a non-iterated zero-sum game, except for situations where conflict can make it negative-sum.
I agree on pretty much every point you’ve raised. I agree that there’s a huge danger in successful DWIMAC or alignment-to-a-person. It could well lead to catastrophic conflict. I think this deserves a lot more analysis, because the creators of AGI will probably going to shoot for that if there’s not a much better argument against than we’ve seen so far.
This was entirely off-topic for this post; I don’t know where we got off topic, but it didn’t start in my last comment. And as you say, I think the choice of alignment target is almost as important as technical alignment techniques.
On the other hand, if alignment to human values isn’t a stable target, we might be better off relying on the good nature of whoever both aligns their AGI to their intent/values, and wins the AGI war. It’s easier to indulge ones’ good nature when there is nearly zero downside to doing so, because you have incontestable control over the known lightcone. Even if horrible things happened in that war, most humans would prefer a happy, flourishing group of humans to be their friend. Sociopaths are the exception, so this route does not fill me with confidence either.
I think there’s more to be worked out here.
Your suggestion that multiple DWIMAC AGIs with different allegiences might establish both the wisdom and a means of cooperating and splitting the rapidly expanding pie. I also place some guarded optimism in that possibility.
I’m not sure if I’m the best person to be thinking/speculating on issues like that: I’m pretty sure I’m a better AI engineer than I am philosopher/ethicist, and there are a lot of people more familiar with the AI policy space than I am. On the other hand, I’m pretty sure I’ve spent longer thinking about the intersection of AI and ethics/philosophy than the great majority of AI engineers have (as in fifteen years), and few of the AI policy people that I’ve read have written much on the “if we solve the Alignment problem, what should we attempt align AI to, and what might the social and Realpolitique consequences of different choices be?” (And then there’s the complicating question of “Are there also internal/technical/stability under reflection/philosophical constraints on that choice?” — to which I strongly suspect the short answer is “yes”, even though I’m not a moral realist.) There was some discussion of this sort of stuff about 10–15 years ago on Less Wrong, but back then we knew a lot less about what sort of AI we were likely to be aligning, what its strengths and weaknesses would be, and how human-like vs alien and incomprehensible an intelligence it would be (the theoretical assumptions back then on Less Wrong tended to be more around some combination of direct construction like AIXI and/or reinforcement learning, rather than SGD token-prediction from the Internet), so we have a lot more useful information now about where the hard and easy parts are likely to be, and about the sociopolitical context.
I feel the same way about being unqualified to consider the geopolitical dynamics. But I also agree that the questions of technical alignment and best alignment target are interconnected (e.g., instruction-following as target seems to make technical alignment much easier). Therefore, I think no single human being is qualified to answer the whole question. As such, I think we need collaboration with people with other expertise. Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?
I agree that we have much better footing to address both the technical and alignment target questions now than 10-15 years ago. So I think we need a new concerted effort.
Also no, but I’m sure there are many such people reading Less Wrong/the Alignment Forum. Perhaps one or both of us should write posts outlining the issues, and see if we can get a discussion started?