On deontology, there’s actually an analysis on whether deontological AI are safer, and the Tl;dr is they aren’t very safe, without stronger or different assumptions.
Wise people with fancy hats are bad at deontology (well actually, everyone is bad at explicit deontology).
What I actually have in mind as a leading candidate for alignment is preference utilitarianism, conceptualized in a non-consequentialist way. That is, you evaluate actions based on (current) human preferences about them, which include preferences over the consequences, but can include other aspects than preference over the consequences, and you don’t per se value how future humans will view the action (though you would also take current-human preferences over this into account).
This could also be self-correcting, in the sense e.g. that it could use preferences_definition_A and humans could want_A it to switch to preferences_definition_B. Not sure if it is self-correcting enough. I don’t have a better candidate for corrigibilty at the moment though.
Edit regarding LLMs: I’m more inclined to think: the base objective of predicting text is not agentic (relative to the real world) at all, and the simulacra generated by an entity following this base objective can be agentic (relative to the real world) due to imitation of agentic text-producing entities, but they’re generally better at the textual appearance of agency than the reality of it; and lack of instrumentality is more the effect of lack of agency-relative-to-the-real-world than the cause of it.
Wise people with fancy hats are bad at deontology (well actually, everyone is bad at explicit deontology).
What I actually have in mind as a leading candidate for alignment is preference utilitarianism, conceptualized in a non-consequentialist way. That is, you evaluate actions based on (current) human preferences about them, which include preferences over the consequences, but can include other aspects than preference over the consequences, and you don’t per se value how future humans will view the action (though you would also take current-human preferences over this into account).
This could also be self-correcting, in the sense e.g. that it could use preferences_definition_A and humans could want_A it to switch to preferences_definition_B. Not sure if it is self-correcting enough. I don’t have a better candidate for corrigibilty at the moment though.
Edit regarding LLMs: I’m more inclined to think: the base objective of predicting text is not agentic (relative to the real world) at all, and the simulacra generated by an entity following this base objective can be agentic (relative to the real world) due to imitation of agentic text-producing entities, but they’re generally better at the textual appearance of agency than the reality of it; and lack of instrumentality is more the effect of lack of agency-relative-to-the-real-world than the cause of it.