the failure mode of an amoral AI system that doesn’t care about you seems both more likely and more amenable to technical safety approaches (to me at least).
It seems to me that at least some parts of this research agenda are relevant for some special cases of “the failure mode of an amoral AI system that doesn’t care about you”. A lot of contemporary AIS research assumes some kind of human-in-the-loop setup (e.g. amplification/debate, recursive reward modeling) and for such setups it seems relevant to consider questions like “under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible?”. Such questions seem relevant under a very wide range of moral systems (including ones that don’t place much weight on s-risks).
It seems to me that at least some parts of this research agenda are relevant for some special cases of “the failure mode of an amoral AI system that doesn’t care about you”.
I still wouldn’t recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I’m claiming that the agenda is totally useless for anything besides s-risks, which I certainly don’t believe. I’ve changed that second paragraph to:
However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other technical safety research to be more impactful, because other approaches can more directly target the failure mode of an amoral AI system that doesn’t care about you, which seems both more likely and more amenable to technical safety approaches (to me at least). I could imagine work on this agenda being quite important for _strategy_ research, though I am far from an expert here.
It seems to me that at least some parts of this research agenda are relevant for some special cases of “the failure mode of an amoral AI system that doesn’t care about you”. A lot of contemporary AIS research assumes some kind of human-in-the-loop setup (e.g. amplification/debate, recursive reward modeling) and for such setups it seems relevant to consider questions like “under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible?”. Such questions seem relevant under a very wide range of moral systems (including ones that don’t place much weight on s-risks).
I still wouldn’t recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I’m claiming that the agenda is totally useless for anything besides s-risks, which I certainly don’t believe. I’ve changed that second paragraph to: