I mostly agree with this, and want to add some more considerations:
The relative difficulty difference between creating a pivotal-act-capable AI and an actually-aligned-to-human-values AI, on the other hand, is at least a lot lower than people think and likely in the opposite direction. My view on this relates to consequentialism—which is NOT utility functions, as commonly misunderstood on lesswrong. By consequentialism I mean caring about the outcome unconditionally, instead of depending on some reason or context. Consequentialism is incompatible with alignment and corrigibility; utility functions on the other hand are fine, and do not implty consequentialism. Consequentialist assumptions prevalent in the rationalist community have, in my view, made alignment seem a lot more impossible than it really is. My impression of Eliezer is that non-consequentialism isn’t on his mental map at all; when he writes about deontology, for instance, it seems like he is imagining it as an abstraction rooted in consequentialism, and not as something actually non-consequentialist.
Weirdly enough, I agree with the top line statement, if for very different reasons than you state or think.
The big reason I agree with this statement is that to a large extent, the alignment community mispredicted how AI would progress, albeit unlike many predictions I’d largely think that this really was mostly unpredictable. Specifically, LLMs progress way faster relative to RL progress, or maybe that was just hyped more.
In particular, LLMs have 1 desirable safety property:
Inherently less willingness to have instrumental goals, and in particular consequentialist goals. This is important because it means that it avoids the traditional AI alignment failures, and in particular AI misalignment is a lot less probable without instrumental goals.
This is plausibly strong enough that once we have the correct goals ala outer alignment, like what Pretraining from Human Feedback sort of did, then alignment might just be done for LLMs.
This is related to porby’s post on Instrumentality making agents agenty, and one important conclusion is that so long as we mostly avoid instrumental goals, which LLMs mostly have by default, due to much more dense information and much more goal constraints, then we mostly avoid models fighting you, which is very important for safety (arguably so important that LLM alignment becomes much easier than general alignment of AI).
So to the extent that alignment researchers mispredicted how much consequentialism is common in AI, it’s related to a upstream mistake, which is in hindsight not noticing how much LLMs would scale, relative to RL scaling, which means instrumental goals mostly don’t matter, which vastly shrinks the problem space.
To put it more pithily, the alignment field is too stuck in RL thinking, and doesn’t realize how much LLMs change the space.
On deontology, there’s actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren’t very safe, without stronger or different assumptions.
The big problem is that most forms of deontology don’t play well with safety, especially of the existential kind, primarily because deontology either actively rewards existential risk or has other unpleasant consequences. In particular, one example is that an AI may use persuasion to make humans essentially commit suicide, and given standard RL, this would be very dangerous due to instrumental goals.
On deontology, there’s actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren’t very safe, without stronger or different assumptions.
Wise people with fancy hats are bad at deontology (well actually, everyone is bad at explicit deontology).
What I actually have in mind as a leading candidate for alignment is preference utilitarianism, conceptualized in a non-consequentialist way. That is, you evaluate actions based on (current) human preferences about them, which include preferences over the consequences, but can include other aspects than preference over the consequences, and you don’t per se value how future humans will view the action (though you would also take current-human preferences over this into account).
This could also be self-correcting, in the sense e.g. that it could use preferences_definition_A and humans could want_A it to switch to preferences_definition_B. Not sure if it is self-correcting enough. I don’t have a better candidate for corrigibilty at the moment though.
Edit regarding LLMs: I’m more inclined to think: the base objective of predicting text is not agentic (relative to the real world) at all, and the simulacra generated by an entity following this base objective can be agentic (relative to the real world) due to imitation of agentic text-producing entities, but they’re generally better at the textual appearance of agency than the reality of it; and lack of instrumentality is more the effect of lack of agency-relative-to-the-real-world than the cause of it.
I mostly agree with this, and want to add some more considerations:
Weirdly enough, I agree with the top line statement, if for very different reasons than you state or think.
The big reason I agree with this statement is that to a large extent, the alignment community mispredicted how AI would progress, albeit unlike many predictions I’d largely think that this really was mostly unpredictable. Specifically, LLMs progress way faster relative to RL progress, or maybe that was just hyped more.
In particular, LLMs have 1 desirable safety property:
Inherently less willingness to have instrumental goals, and in particular consequentialist goals. This is important because it means that it avoids the traditional AI alignment failures, and in particular AI misalignment is a lot less probable without instrumental goals.
This is plausibly strong enough that once we have the correct goals ala outer alignment, like what Pretraining from Human Feedback sort of did, then alignment might just be done for LLMs.
This is related to porby’s post on Instrumentality making agents agenty, and one important conclusion is that so long as we mostly avoid instrumental goals, which LLMs mostly have by default, due to much more dense information and much more goal constraints, then we mostly avoid models fighting you, which is very important for safety (arguably so important that LLM alignment becomes much easier than general alignment of AI).
Here’s the post:
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
And here’s the comment that led me to make that observation:
https://www.lesswrong.com/posts/rmfjo4Wmtgq8qa2B7/?commentId=GKhn2ktBuxjNhmaWB
So to the extent that alignment researchers mispredicted how much consequentialism is common in AI, it’s related to a upstream mistake, which is in hindsight not noticing how much LLMs would scale, relative to RL scaling, which means instrumental goals mostly don’t matter, which vastly shrinks the problem space.
To put it more pithily, the alignment field is too stuck in RL thinking, and doesn’t realize how much LLMs change the space.
On deontology, there’s actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren’t very safe, without stronger or different assumptions.
The big problem is that most forms of deontology don’t play well with safety, especially of the existential kind, primarily because deontology either actively rewards existential risk or has other unpleasant consequences. In particular, one example is that an AI may use persuasion to make humans essentially commit suicide, and given standard RL, this would be very dangerous due to instrumental goals.
But there is more in the post below:
https://www.lesswrong.com/posts/gbNqWpDwmrWmzopQW/is-deontological-ai-safe-feedback-draft
Boundaries/Membranes may improve the situation, but that hasn’t yet been tried, nor do we have any data on how Boundaries/Membranes could work.
This is my main comment re pivotal acts and dentology, and while I mostly agree with you, I don’t totally agree with you here.
Wise people with fancy hats are bad at deontology (well actually, everyone is bad at explicit deontology).
What I actually have in mind as a leading candidate for alignment is preference utilitarianism, conceptualized in a non-consequentialist way. That is, you evaluate actions based on (current) human preferences about them, which include preferences over the consequences, but can include other aspects than preference over the consequences, and you don’t per se value how future humans will view the action (though you would also take current-human preferences over this into account).
This could also be self-correcting, in the sense e.g. that it could use preferences_definition_A and humans could want_A it to switch to preferences_definition_B. Not sure if it is self-correcting enough. I don’t have a better candidate for corrigibilty at the moment though.
Edit regarding LLMs: I’m more inclined to think: the base objective of predicting text is not agentic (relative to the real world) at all, and the simulacra generated by an entity following this base objective can be agentic (relative to the real world) due to imitation of agentic text-producing entities, but they’re generally better at the textual appearance of agency than the reality of it; and lack of instrumentality is more the effect of lack of agency-relative-to-the-real-world than the cause of it.