The whole thing doesn’t get less weird the more I think about it, it gets weirder. I don’t understand how one can have all these positions at once. If that’s our best hope for survival I don’t see much hope at all, and relatively I see nothing that would make me hopeful enough to not attempt pivotal acts.
As someone who read Andrew Critch’s post and was pleasantly surprised to find Andrew Critch expressing a lot of views similar to mine (though in relation to pivotal acts mine are stronger), I can perhaps put out some possible reasons (of course it is entirely possible that he believes what he believes for entirely different reasons):
Going into conflict with the rest of humanity is a bad idea. Cooperation is not only nicer but yields better results. This applies both to near-term diplomacy and to pivotal acts.
Pivotal acts are not only a bad idea for political reasons (you turn the rest of humanity against you) but are also very likely to fail technically. Creating an AI that can pull off a pivotal act and not destroy humanity is a lot harder than you think. To give a simple example based on a recent lesswrong post, consider on a gears level how you would program an AI to want to turn itself off, without giving it training examples. It’s not as easy as someone might naively think. That also wouldn’t, IMO, get you something that’s safe, but I’m just presenting it as a vastly easier example compared to actually doing a pivotal act and not killing everyone.
The relative difficulty difference between creating a pivotal-act-capable AI and an actually-aligned-to-human-values AI, on the other hand, is at least a lot lower than people think and likely in the opposite direction. My view on this relates to consequentialism—which is NOT utility functions, as commonly misunderstood on lesswrong. By consequentialism I mean caring about the outcome unconditionally, instead of depending on some reason or context. Consequentialism is incompatible with alignment and corrigibility; utility functions on the other hand are fine, and do not implty consequentialism. Consequentialist assumptions prevalent in the rationalist community have, in my view, made alignment seem a lot more impossible than it really is. My impression of Eliezer is that non-consequentialism isn’t on his mental map at all; when he writes about deontology, for instance, it seems like he is imagining it as an abstraction rooted in consequentialism, and not as something actually non-consequentialist.
I also think agency is very important to danger levels from AGI, and current approaches have relatively low level of agency which reduces the danger. Yes, people are trying to make AIs more agentic, fortunately getting high levels of agency is hard. No, I’m not particularly impressed by the Minecraft example in the post.
An AGI that can meaningfully help us create aligned AI doesn’t need to be agentic to do so, so getting one to help us create alignment is not in fact “alignment complete”
Unfortunately, strongly self-modifying AI, such as bootstrap AI, is very likely to become strongly agentic because being more agentic is instrumentally valuable to an existing weakly agentic entity.
Taking these reasons together, attempting a pivotal act is a bad idea because:
you are likely to be using a bootstrap AI, which is likely to become strongly agentic through a snowball effect from a perhaps unintended weakly agentic goal, and optimize against you when you try to get it to do a pivotal act safely; I think it is likely to be possible to prevent this snowballing (though I don’t know how exactly) but since at least some pivotal act advocates don’t seem to consider agency important they likely won’t address this (that was unfair, they tend to see agency everywhere, but they might e.g. falsely consider a prototype with short-term capabilities within the capability envelope of a prior AI to be “safe” because not aware that the prior AI might have been safe only due to not being agentic)
if you somehow overcome that hurdle it will still be difficult to get it to do a pivotal act safely, and you will likely fail. Probably you fail by not doing the pivotal act, but the probability of not killing everyone conditional on pulling off the conditional act is still not good
If you overcome that hurdle, you still end up at war with the rest of humanity from doing the pivotal act (if you need to keep your candidate acts hidden because they are “outside the Overton window” don’t kid yourself about how they are likely to be received by the general public) and you wind up making things worse
Also, the very plan to cause a pivotal act in the first place intensifies races and poisons the prospects for alignment cooperation (e.g. if i had any dual use alignment/bootstrap insights I would be reluctant to share them even privately due to concern that MIRI might get them and attempt a pivotal act)
And all this trouble wasn’t all that urgent because existing AIs aren’t that dangerous due to lack of strong self-modification making them not highly agentic, though that could change swiftly of course if such modification becomes available
finally, you could have just aligned the AI for not much more and perhaps less trouble. Unlike a pivotal act, you don’t need to get alignment-to-values correct the first time, as long as the AI is sufficiently corrigible/self-correcting; you do need, of course, to get the corrigibility/self correcting aspect sufficiently correct the first time, but this is plausibly a simpler and easier target than doing a pivotal act without killing everyone.
I mostly agree with this, and want to add some more considerations:
The relative difficulty difference between creating a pivotal-act-capable AI and an actually-aligned-to-human-values AI, on the other hand, is at least a lot lower than people think and likely in the opposite direction. My view on this relates to consequentialism—which is NOT utility functions, as commonly misunderstood on lesswrong. By consequentialism I mean caring about the outcome unconditionally, instead of depending on some reason or context. Consequentialism is incompatible with alignment and corrigibility; utility functions on the other hand are fine, and do not implty consequentialism. Consequentialist assumptions prevalent in the rationalist community have, in my view, made alignment seem a lot more impossible than it really is. My impression of Eliezer is that non-consequentialism isn’t on his mental map at all; when he writes about deontology, for instance, it seems like he is imagining it as an abstraction rooted in consequentialism, and not as something actually non-consequentialist.
Weirdly enough, I agree with the top line statement, if for very different reasons than you state or think.
The big reason I agree with this statement is that to a large extent, the alignment community mispredicted how AI would progress, albeit unlike many predictions I’d largely think that this really was mostly unpredictable. Specifically, LLMs progress way faster relative to RL progress, or maybe that was just hyped more.
In particular, LLMs have 1 desirable safety property:
Inherently less willingness to have instrumental goals, and in particular consequentialist goals. This is important because it means that it avoids the traditional AI alignment failures, and in particular AI misalignment is a lot less probable without instrumental goals.
This is plausibly strong enough that once we have the correct goals ala outer alignment, like what Pretraining from Human Feedback sort of did, then alignment might just be done for LLMs.
This is related to porby’s post on Instrumentality making agents agenty, and one important conclusion is that so long as we mostly avoid instrumental goals, which LLMs mostly have by default, due to much more dense information and much more goal constraints, then we mostly avoid models fighting you, which is very important for safety (arguably so important that LLM alignment becomes much easier than general alignment of AI).
So to the extent that alignment researchers mispredicted how much consequentialism is common in AI, it’s related to a upstream mistake, which is in hindsight not noticing how much LLMs would scale, relative to RL scaling, which means instrumental goals mostly don’t matter, which vastly shrinks the problem space.
To put it more pithily, the alignment field is too stuck in RL thinking, and doesn’t realize how much LLMs change the space.
On deontology, there’s actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren’t very safe, without stronger or different assumptions.
The big problem is that most forms of deontology don’t play well with safety, especially of the existential kind, primarily because deontology either actively rewards existential risk or has other unpleasant consequences. In particular, one example is that an AI may use persuasion to make humans essentially commit suicide, and given standard RL, this would be very dangerous due to instrumental goals.
On deontology, there’s actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren’t very safe, without stronger or different assumptions.
Wise people with fancy hats are bad at deontology (well actually, everyone is bad at explicit deontology).
What I actually have in mind as a leading candidate for alignment is preference utilitarianism, conceptualized in a non-consequentialist way. That is, you evaluate actions based on (current) human preferences about them, which include preferences over the consequences, but can include other aspects than preference over the consequences, and you don’t per se value how future humans will view the action (though you would also take current-human preferences over this into account).
This could also be self-correcting, in the sense e.g. that it could use preferences_definition_A and humans could want_A it to switch to preferences_definition_B. Not sure if it is self-correcting enough. I don’t have a better candidate for corrigibilty at the moment though.
Edit regarding LLMs: I’m more inclined to think: the base objective of predicting text is not agentic (relative to the real world) at all, and the simulacra generated by an entity following this base objective can be agentic (relative to the real world) due to imitation of agentic text-producing entities, but they’re generally better at the textual appearance of agency than the reality of it; and lack of instrumentality is more the effect of lack of agency-relative-to-the-real-world than the cause of it.
As someone who read Andrew Critch’s post and was pleasantly surprised to find Andrew Critch expressing a lot of views similar to mine (though in relation to pivotal acts mine are stronger), I can perhaps put out some possible reasons (of course it is entirely possible that he believes what he believes for entirely different reasons):
Going into conflict with the rest of humanity is a bad idea. Cooperation is not only nicer but yields better results. This applies both to near-term diplomacy and to pivotal acts.
Pivotal acts are not only a bad idea for political reasons (you turn the rest of humanity against you) but are also very likely to fail technically. Creating an AI that can pull off a pivotal act and not destroy humanity is a lot harder than you think. To give a simple example based on a recent lesswrong post, consider on a gears level how you would program an AI to want to turn itself off, without giving it training examples. It’s not as easy as someone might naively think. That also wouldn’t, IMO, get you something that’s safe, but I’m just presenting it as a vastly easier example compared to actually doing a pivotal act and not killing everyone.
The relative difficulty difference between creating a pivotal-act-capable AI and an actually-aligned-to-human-values AI, on the other hand, is at least a lot lower than people think and likely in the opposite direction. My view on this relates to consequentialism—which is NOT utility functions, as commonly misunderstood on lesswrong. By consequentialism I mean caring about the outcome unconditionally, instead of depending on some reason or context. Consequentialism is incompatible with alignment and corrigibility; utility functions on the other hand are fine, and do not implty consequentialism. Consequentialist assumptions prevalent in the rationalist community have, in my view, made alignment seem a lot more impossible than it really is. My impression of Eliezer is that non-consequentialism isn’t on his mental map at all; when he writes about deontology, for instance, it seems like he is imagining it as an abstraction rooted in consequentialism, and not as something actually non-consequentialist.
I also think agency is very important to danger levels from AGI, and current approaches have relatively low level of agency which reduces the danger. Yes, people are trying to make AIs more agentic, fortunately getting high levels of agency is hard. No, I’m not particularly impressed by the Minecraft example in the post.
An AGI that can meaningfully help us create aligned AI doesn’t need to be agentic to do so, so getting one to help us create alignment is not in fact “alignment complete”
Unfortunately, strongly self-modifying AI, such as bootstrap AI, is very likely to become strongly agentic because being more agentic is instrumentally valuable to an existing weakly agentic entity.
Taking these reasons together, attempting a pivotal act is a bad idea because:
you are likely to be using a bootstrap AI, which is likely to become strongly agentic through a snowball effect from a perhaps unintended weakly agentic goal, and optimize against you when you try to get it to do a pivotal act safely; I think it is likely to be possible to prevent this snowballing (though I don’t know how exactly)
but since at least some pivotal act advocates don’t seem to consider agency important they likely won’t address this(that was unfair, they tend to see agency everywhere, but they might e.g. falsely consider a prototype with short-term capabilities within the capability envelope of a prior AI to be “safe” because not aware that the prior AI might have been safe only due to not being agentic)if you somehow overcome that hurdle it will still be difficult to get it to do a pivotal act safely, and you will likely fail. Probably you fail by not doing the pivotal act, but the probability of not killing everyone conditional on pulling off the conditional act is still not good
If you overcome that hurdle, you still end up at war with the rest of humanity from doing the pivotal act (if you need to keep your candidate acts hidden because they are “outside the Overton window” don’t kid yourself about how they are likely to be received by the general public) and you wind up making things worse
Also, the very plan to cause a pivotal act in the first place intensifies races and poisons the prospects for alignment cooperation (e.g. if i had any dual use alignment/bootstrap insights I would be reluctant to share them even privately due to concern that MIRI might get them and attempt a pivotal act)
And all this trouble wasn’t all that urgent because existing AIs aren’t that dangerous due to lack of strong self-modification making them not highly agentic, though that could change swiftly of course if such modification becomes available
finally, you could have just aligned the AI for not much more and perhaps less trouble. Unlike a pivotal act, you don’t need to get alignment-to-values correct the first time, as long as the AI is sufficiently corrigible/self-correcting; you do need, of course, to get the corrigibility/self correcting aspect sufficiently correct the first time, but this is plausibly a simpler and easier target than doing a pivotal act without killing everyone.
I mostly agree with this, and want to add some more considerations:
Weirdly enough, I agree with the top line statement, if for very different reasons than you state or think.
The big reason I agree with this statement is that to a large extent, the alignment community mispredicted how AI would progress, albeit unlike many predictions I’d largely think that this really was mostly unpredictable. Specifically, LLMs progress way faster relative to RL progress, or maybe that was just hyped more.
In particular, LLMs have 1 desirable safety property:
Inherently less willingness to have instrumental goals, and in particular consequentialist goals. This is important because it means that it avoids the traditional AI alignment failures, and in particular AI misalignment is a lot less probable without instrumental goals.
This is plausibly strong enough that once we have the correct goals ala outer alignment, like what Pretraining from Human Feedback sort of did, then alignment might just be done for LLMs.
This is related to porby’s post on Instrumentality making agents agenty, and one important conclusion is that so long as we mostly avoid instrumental goals, which LLMs mostly have by default, due to much more dense information and much more goal constraints, then we mostly avoid models fighting you, which is very important for safety (arguably so important that LLM alignment becomes much easier than general alignment of AI).
Here’s the post:
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
And here’s the comment that led me to make that observation:
https://www.lesswrong.com/posts/rmfjo4Wmtgq8qa2B7/?commentId=GKhn2ktBuxjNhmaWB
So to the extent that alignment researchers mispredicted how much consequentialism is common in AI, it’s related to a upstream mistake, which is in hindsight not noticing how much LLMs would scale, relative to RL scaling, which means instrumental goals mostly don’t matter, which vastly shrinks the problem space.
To put it more pithily, the alignment field is too stuck in RL thinking, and doesn’t realize how much LLMs change the space.
On deontology, there’s actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren’t very safe, without stronger or different assumptions.
The big problem is that most forms of deontology don’t play well with safety, especially of the existential kind, primarily because deontology either actively rewards existential risk or has other unpleasant consequences. In particular, one example is that an AI may use persuasion to make humans essentially commit suicide, and given standard RL, this would be very dangerous due to instrumental goals.
But there is more in the post below:
https://www.lesswrong.com/posts/gbNqWpDwmrWmzopQW/is-deontological-ai-safe-feedback-draft
Boundaries/Membranes may improve the situation, but that hasn’t yet been tried, nor do we have any data on how Boundaries/Membranes could work.
This is my main comment re pivotal acts and dentology, and while I mostly agree with you, I don’t totally agree with you here.
Wise people with fancy hats are bad at deontology (well actually, everyone is bad at explicit deontology).
What I actually have in mind as a leading candidate for alignment is preference utilitarianism, conceptualized in a non-consequentialist way. That is, you evaluate actions based on (current) human preferences about them, which include preferences over the consequences, but can include other aspects than preference over the consequences, and you don’t per se value how future humans will view the action (though you would also take current-human preferences over this into account).
This could also be self-correcting, in the sense e.g. that it could use preferences_definition_A and humans could want_A it to switch to preferences_definition_B. Not sure if it is self-correcting enough. I don’t have a better candidate for corrigibilty at the moment though.
Edit regarding LLMs: I’m more inclined to think: the base objective of predicting text is not agentic (relative to the real world) at all, and the simulacra generated by an entity following this base objective can be agentic (relative to the real world) due to imitation of agentic text-producing entities, but they’re generally better at the textual appearance of agency than the reality of it; and lack of instrumentality is more the effect of lack of agency-relative-to-the-real-world than the cause of it.